-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access to buckets on AWS and GCP from local computers #22
Comments
Would this work for someone on a HPC system aswell? If so that might be a solution to the ticket I opened today (not sure how to link those TBH). |
I lack experience of being on HPC systems, but is the difference between "your computer" and "a hpc system" that you just have terminal access - as compared to the ability to open a browser etc? Then, yes is the answer you seek I think. You can still extract temporary cloud credentials from a hub at 2i2c, this ought to be independent of where you extract them to. And then, these can be used from a terminal on a HPC system using the |
That sounds good. Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff. |
If you can verify this workflow @jbusecke, it would be helpful!
Note that the token lasts for one hour, and that if you re-run the print-access-token command, it will rely on a previous cache I think so it will be one hour since initial generation unless you clear the cache from somewhere in the home folder. |
Just to confirm:
This would be on a running server on the hub? And installation is via these instructions? |
@jbusecke yep! But I think on the pangeo-data images, they are probably already installed. They're also available from conda if you prefer https://anaconda.org/conda-forge/google-cloud-sdk |
I just tested |
Ok here are the steps I took:
from google.cloud import storage
from google.oauth2.credentials import Credentials
# import an access token
# - option 1: read an access token from a file
with open("token.txt") as f:
access_token = f.read().strip()
# setup a storage client using credentials
credentials = Credentials(access_token)
storage_client = storage.Client(credentials=credentials) and got this warning:
# test the storage client by trying to list content in a google storage bucket
bucket_name = "leap-scratch/jbusecke" # don't include gs:// here
blobs = list(storage_client.list_blobs(bucket_name))
print(len(blobs)) which got me an 404 error
---------------------------------------------------------------------------
NotFound Traceback (most recent call last)
Cell In[3], line 3
1 # test the storage client by trying to list content in a google storage bucket
2 bucket_name = "leap-scratch/jbusecke" # don't include gs:// here
----> 3 blobs = list(storage_client.list_blobs(bucket_name))
4 print(len(blobs))
File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:208, in Iterator._items_iter(self) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:244, in Iterator._page_iter(self, increment) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:373, in HTTPIterator._next_page(self) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:432, in HTTPIterator._get_next_page_response(self) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/storage/_http.py:72, in Connection.api_request(self, *args, **kwargs) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.call..retry_wrapped_func(*args, **kwargs) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs) File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/_http/init.py:494, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info) NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/leap-scratch/jbusecke/o?projection=noAcl&prettyPrint=false: Not Found Am I using the url path wrong here? |
@jbusecke try the bucket name as just |
I think you can also use the environment variable |
Yay! That worked.
As in exporting that on my local machine? I suppose that for many of the workflows we would want to have a notebook/script on the HPC cluster which creates an xarray object from e.g. many netcdfs and then write a zarr store directly to the bucket (unless this is not a recommended workflow). Is there a way to use this token with gcsfs? I just tried naively: fs = gcsfs.GCSFileSystem(token=access_token) which errors with
and then prints the token 😱, which is not ideal |
Looking at https://gcsfs.readthedocs.io/en/latest/#credentials, looks like you can pass the |
Amazing. To wrap up what I did: Then import gcsfs
import xarray as xr
fs = gcsfs.GCSFileSystem(token=credentials)
ds = xr.DataArray([1]).to_dataset(name='test')
mapper = fs.get_mapper('leap-scratch/jbusecke/test_offsite_upload.zarr')
ds.to_zarr(mapper) |
This is awesome! Thanks. I will try this tomorrow with a collaborator. One last question. The collaborator should extract the token from their account, correct? |
I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful. |
I'm not confident you get shut down if the token expires, the token can't be checket at every byte sent etc - so when is it checked? Is it checked in between each object uploaded by for example @jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation! I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours. |
Also is it better to not assign the token to a variable for security reasons? E.g. with open("token.txt") as f:
# setup a storage client using credentials
credentials = Credentials(f.read().strip()) Then again this only lives for 1 hour, so the risk is not particularly high I guess. Another comment re security: I noticed that with this credential I can also delete files. I did |
Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.
I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?) |
@jbusecke I'd like this issue to stay scoped to how to extract short lived credentials matching those provided to you as a user on the user server provided by the hub. For another, one can consider if its feasible for 2i2c to help provide read-only credentials to a few users and read/write to others, but its additional unrelated customizations on the credentials provided in the first place to the user server.
I'd like these docs to live in scottyhq/jupyter-cloud-scoped-creds as a project, without assumptions of coupling to 2i2c or similar. I've proposed its a project that we help get into the jupyterhub github org in the long run etc also.
No clue! |
Sounds good @consideRatio. Ill report back how our testing goes tomorrow. |
Hey everyone, But I think that my suspicion about the short validity of the token
turnes out to be a problem here. @jerrylin96 got an I suspect every chunk written requires a valid authentication and thus most of the datasets we are (and will be) using would require an access token that is valid for a longer time. @consideRatio is it possible to configure the time the token is valid? |
I don't really think there's a way to make that token have a longer duration. I think instead, we should make a separate service account and try to securely provide credentials for that. |
👍 on adding a specific service account to address this short term at least. I remember reading about ways of getting longer durations for the tokens either for AWS and/or GCP, but that it required cloud configuration to allow for it combined with explicit configuration in the request. |
Thanks for the update! Getting this figured out would unlock a bunch of people here at LEAP to upload and share datasets with the project (this will definitely also accelerate science) and is thus very high on my internal priorities list. If there is any way I can help with this, please let me know. |
GCS allows individual Google Users as well as Google Groups to have permissions to read / write to GCS buckets (unlike AWS). We can use this to allow community leaders to manage who can read and write to GCS buckets from outside the cloud by managing membership in a Google Group! In this commit, we set up the persistent buckets of the LEAP hubs to have this functionality. Access is managed via a Google Group - I have temporarily created this under the 2i2c org and invited Julius (the community champion) as an administrator. But perhaps it should be just created as a regular google group. Using groups here allows management of this access to not require any 2i2c engineering work. Future work would probably fold the separate variable we have for determining if a bucket is accessible publicly as an attribute as well. Ref https://github.com/2i2c-org/infrastructure/issues/2096
GCS allows individual Google Users as well as Google Groups to have permissions to read / write to GCS buckets (unlike AWS). We can use this to allow community leaders to manage who can read and write to GCS buckets from outside the cloud by managing membership in a Google Group! In this commit, we set up the persistent buckets of the LEAP hubs to have this functionality. Access is managed via a Google Group - I have temporarily created this under the 2i2c org and invited Julius (the community champion) as an administrator. But perhaps it should be just created as a regular google group. Using groups here allows management of this access to not require any 2i2c engineering work. Future work would probably fold the separate variable we have for determining if a bucket is accessible publicly as an attribute as well. Ref https://github.com/2i2c-org/infrastructure/issues/2096
GCS allows individual Google Users as well as Google Groups to have permissions to read / write to GCS buckets (unlike AWS). We can use this to allow community leaders to manage who can read and write to GCS buckets from outside the cloud by managing membership in a Google Group! In this commit, we set up the persistent buckets of the LEAP hubs to have this functionality. Access is managed via a Google Group - I have temporarily created this under the 2i2c org and invited Julius (the community champion) as an administrator. But perhaps it should be just created as a regular google group. Using groups here allows management of this access to not require any 2i2c engineering work. Future work would probably fold the separate variable we have for determining if a bucket is accessible publicly as an attribute as well. Ref https://github.com/2i2c-org/infrastructure/issues/2096
We provide storage buckets for users to write to and setup credentials for them within the user servers started on the hubs. But, what if they want to upload something to those hubs from their local computer or similar - then how do they acquire permissions to do so?
@scottyhq has developed scottyhq/jupyter-cloud-scoped-creds but that currently has support for AWS S3 buckets but not for GCP buckets.
Work items
Resovled by Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment).
This should be documented as well, and it would be very good if they are just temporary compared to long term re-usable.
We rely on GCP's workload identity, and AWS's IRSA, which are mechanisms to couple a k8s ServiceAccount with cloud provider credentials.
Resolved by Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment) - they are "short lived" and valid for one hour.
User requests
Maybe related
Related
jupyter-cloud-creds
? scottyhq/jupyter-cloud-scoped-creds#5aws
orgcloud
CLI's are missing scottyhq/jupyter-cloud-scoped-creds#6The text was updated successfully, but these errors were encountered: