Access to buckets on AWS and GCP from local computers #22

consideRatio · 2023-01-26T19:27:47Z

We provide storage buckets for users to write to and setup credentials for them within the user servers started on the hubs. But, what if they want to upload something to those hubs from their local computer or similar - then how do they acquire permissions to do so?

@scottyhq has developed scottyhq/jupyter-cloud-scoped-creds but that currently has support for AWS S3 buckets but not for GCP buckets.

Work items

Trial scottyhq/jupyter-cloud-scoped-creds with an AWS S3 bucket
Figure out how to do the same against GCP buckets
Resovled by Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment).
Figure out what credentials is extracted, are they temporary or long term?
This should be documented as well, and it would be very good if they are just temporary compared to long term re-usable.
We rely on GCP's workload identity, and AWS's IRSA, which are mechanisms to couple a k8s ServiceAccount with cloud provider credentials.
Resolved by Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment) - they are "short lived" and valid for one hour.
Upstream the equivalent for GCP buckets
Document use of scottyhq/jupyter-cloud-scoped-creds for AWS and GCP buckets
Communicate progress to Julius in https://2i2c.freshdesk.com/a/tickets/387 and Arpita in https://2i2c.freshdesk.com/a/tickets/322

User requests

Maybe related

Support sending requestor pays access to any S3 bucket infrastructure#1980

Then, yes is the answer you seek I think. You can still extract temporary cloud credentials from a hub at 2i2c, this ought to be independent of where you extract them to. And then, these can be used from a terminal on a HPC system using the aws or gcloud cli, or for example at least google's Python cloud storage client.

jbusecke · 2023-02-13T19:38:08Z

That sounds good. Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

consideRatio · 2023-02-13T19:47:45Z

Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

If you can verify this workflow @jbusecke, it would be helpful!

Install gcloud (google-cloud-sdk) in the user image you use if its not already installed
Start a user server, enter a terminal, and run: gcloud auth print-access-token
Use the generated token like described in Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment) from a HPC terminal with gcloud storage cp or similar.

Note that the token lasts for one hour, and that if you re-run the print-access-token command, it will rely on a previous cache I think so it will be one hour since initial generation unless you clear the cache from somewhere in the home folder.

jbusecke · 2023-02-13T21:36:09Z

Just to confirm:

Install gcloud (google-cloud-sdk) in the user image you use if its not already installed

This would be on a running server on the hub? And installation is via these instructions?

yuvipanda · 2023-02-13T21:38:54Z

@jbusecke yep! But I think on the pangeo-data images, they are probably already installed. They're also available from conda if you prefer https://anaconda.org/conda-forge/google-cloud-sdk

jbusecke · 2023-02-13T21:41:00Z

But I think on the pangeo-data images, they are probably already installed.

I just tested gcloud --help and got bash: gcloud: command not found. I believe this means I have to install it? Ill try the conda route.

jbusecke · 2023-02-13T21:58:51Z

Ok here are the steps I took:

Installed google-cloud-sdk on my running server with mamba install google-cloud-sdk
Generated token with gcloud auth print-access-token
Copied that token into a local text file on my laptop token.txt
On my laptop I ran

from google.cloud import storage
from google.oauth2.credentials import Credentials

# import an access token
# - option 1: read an access token from a file
with open("token.txt") as f:
    access_token = f.read().strip()

# setup a storage client using credentials
credentials = Credentials(access_token)
storage_client = storage.Client(credentials=credentials)

and got this warning:

/Users/juliusbusecke/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/auth/_default.py:83: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

I then tried to ls my leap scratch bucket:

# test the storage client by trying to list content in a google storage bucket
bucket_name = "leap-scratch/jbusecke"  # don't include gs:// here
blobs = list(storage_client.list_blobs(bucket_name))
print(len(blobs))

which got me an 404 error

--------------------------------------------------------------------------- NotFound Traceback (most recent call last) Cell In[3], line 3 1 # test the storage client by trying to list content in a google storage bucket 2 bucket_name = "leap-scratch/jbusecke" # don't include gs:// here ----> 3 blobs = list(storage_client.list_blobs(bucket_name)) 4 print(len(blobs))

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:208, in Iterator._items_iter(self)
206 def _items_iter(self):
207 """Iterator for each item returned."""
--> 208 for page in self._page_iter(increment=False):
209 for item in page:
210 self.num_results += 1

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:244, in Iterator._page_iter(self, increment)
232 def _page_iter(self, increment):
233 """Generator of pages of API responses.
234
235 Args:
(...)
242 Page: each page of items from the API.
243 """
--> 244 page = self._next_page()
245 while page is not None:
246 self.page_number += 1

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:373, in HTTPIterator._next_page(self)
366 """Get the next page in the iterator.
367
368 Returns:
369 Optional[Page]: The next page in the iterator or :data:None if
370 there are no pages left.
371 """
372 if self._has_next_page():
--> 373 response = self._get_next_page_response()
374 items = response.get(self._items_key, ())
375 page = Page(self, items, self.item_to_value, raw_page=response)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:432, in HTTPIterator._get_next_page_response(self)
430 params = self._get_query_params()
431 if self._HTTP_METHOD == "GET":
--> 432 return self.api_request(
433 method=self._HTTP_METHOD, path=self.path, query_params=params
434 )
435 elif self._HTTP_METHOD == "POST":
436 return self.api_request(
437 method=self._HTTP_METHOD, path=self.path, data=params
438 )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/storage/_http.py:72, in Connection.api_request(self, *args, **kwargs)
70 if retry:
71 call = retry(call)
---> 72 return call()

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.call..retry_wrapped_func(*args, **kwargs)
345 target = functools.partial(func, *args, **kwargs)
346 sleep_generator = exponential_sleep_generator(
347 self._initial, self._maximum, multiplier=self._multiplier
348 )
--> 349 return retry_target(
350 target,
351 self._predicate,
352 sleep_generator,
353 self._timeout,
354 on_error=on_error,
355 )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs)
189 for sleep in sleep_generator:
190 try:
--> 191 return target()
193 # pylint: disable=broad-except
194 # This function explicitly must deal with broad exceptions.
195 except Exception as exc:

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/_http/init.py:494, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info)
482 response = self._make_request(
483 method=method,
484 url=url,
(...)
490 extra_api_info=extra_api_info,
491 )
493 if not 200 <= response.status_code < 300:
--> 494 raise exceptions.from_http_response(response)
496 if expect_json and response.content:
497 return response.json()

NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/leap-scratch/jbusecke/o?projection=noAcl&prettyPrint=false: Not Found

Am I using the url path wrong here?

yuvipanda · 2023-02-13T22:03:38Z

@jbusecke try the bucket name as just leap-scratch?

yuvipanda · 2023-02-13T22:04:14Z

I think you can also use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN, and then use regular gsutil commands to access storage.

jbusecke · 2023-02-13T22:10:16Z

@jbusecke try the bucket name as just leap-scratch?

Yay! That worked.

use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN,

As in exporting that on my local machine?

I suppose that for many of the workflows we would want to have a notebook/script on the HPC cluster which creates an xarray object from e.g. many netcdfs and then write a zarr store directly to the bucket (unless this is not a recommended workflow). Is there a way to use this token with gcsfs? I just tried naively:

fs = gcsfs.GCSFileSystem(token=access_token)

which errors with

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 fs = gcsfs.GCSFileSystem(token=access_token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/fsspec/spec.py:76, in _Cached.__call__(cls, *args, **kwargs)
     74     return cls._cache[token]
     75 else:
---> 76     obj = super().__call__(*args, **kwargs)
     77     # Setting _fs_token here causes some static linters to complain.
     78     obj._fs_token_ = token

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/core.py:305, in GCSFileSystem.__init__(self, project, access, token, block_size, consistency, cache_timeout, secure_serialize, check_connection, requests_timeout, requester_pays, asynchronous, session_kwargs, loop, timeout, endpoint_url, default_location, version_aware, **kwargs)
    299 if check_connection:
    300     warnings.warn(
    301         "The `check_connection` argument is deprecated and will be removed in a future release.",
    302         DeprecationWarning,
    303     )
--> 305 self.credentials = GoogleCredentials(project, access, token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:50, in GoogleCredentials.__init__(self, project, access, token, check_credentials)
     48 self.lock = threading.Lock()
     49 self.token = token
---> 50 self.connect(method=token)
     52 if check_credentials:
     53     warnings.warn(
     54         "The `check_credentials` argument is deprecated and will be removed in a future release.",
     55         DeprecationWarning,
     56     )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:226, in GoogleCredentials.connect(self, method)
    207 """
    208 Establish session token. A new token will be requested if the current
    209 one is within 100s of expiry.
   (...)
    215     If None, will try sequence of methods.
    216 """
    217 if method not in [
    218     "google_default",
    219     "cache",
   (...)
    224     None,
    225 ]:
--> 226     self._connect_token(method)
    227 elif method is None:
    228     for meth in ["google_default", "cache", "cloud", "anon"]:

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:147, in GoogleCredentials._connect_token(self, token)
    145 if isinstance(token, str):
    146     if not os.path.exists(token):
--> 147         raise FileNotFoundError(token)
    148     try:
    149         # is this a "service" token?
    150         self._connect_service(token)

FileNotFoundError:

and then prints the token 😱, which is not ideal

yuvipanda · 2023-02-13T22:12:23Z

Looking at https://gcsfs.readthedocs.io/en/latest/#credentials, looks like you can pass the Credentials object with the token in it rather than the string.

jbusecke · 2023-02-13T22:19:15Z

Amazing. To wrap up what I did:
Steps 1-4 as above.

Then

import gcsfs
import xarray as xr
fs = gcsfs.GCSFileSystem(token=credentials)
ds = xr.DataArray([1]).to_dataset(name='test')
mapper = fs.get_mapper('leap-scratch/jbusecke/test_offsite_upload.zarr')
ds.to_zarr(mapper)

and I confirmed that the zarr array was written:

jbusecke · 2023-02-13T22:20:20Z

This is awesome! Thanks.

I will try this tomorrow with a collaborator. One last question. The collaborator should extract the token from their account, correct?

jbusecke · 2023-02-13T22:21:29Z

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

consideRatio · 2023-02-13T22:25:10Z

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

I'm not confident you get shut down if the token expires, the token can't be checket at every byte sent etc - so when is it checked? Is it checked in between each object uploaded by for example gsutil, or between each request made?

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation! I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

jbusecke · 2023-02-13T22:26:21Z

Also is it better to not assign the token to a variable for security reasons?

E.g.

with open("token.txt") as f:
    # setup a storage client using credentials
    credentials = Credentials(f.read().strip())

Then again this only lives for 1 hour, so the risk is not particularly high I guess.

Another comment re security: I noticed that with this credential I can also delete files. I did fs.rm('leap-scratch/jbusecke/test_offsite_upload.zarr', recursive=True). Wondering if there is a way for write/read only permissions to avoid mishaps for novel users.

jbusecke · 2023-02-13T22:30:06Z

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation!

I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

consideRatio · 2023-02-13T22:34:01Z

@jbusecke I'd like this issue to stay scoped to how to extract short lived credentials matching those provided to you as a user on the user server provided by the hub.

For another, one can consider if its feasible for 2i2c to help provide read-only credentials to a few users and read/write to others, but its additional unrelated customizations on the credentials provided in the first place to the user server.

I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

I'd like these docs to live in scottyhq/jupyter-cloud-scoped-creds as a project, without assumptions of coupling to 2i2c or similar. I've proposed its a project that we help get into the jupyterhub github org in the long run etc also.

Do you think this is also valid [...]

No clue!

jbusecke · 2023-02-13T22:35:51Z

Sounds good @consideRatio. Ill report back how our testing goes tomorrow.

jbusecke · 2023-03-14T19:14:30Z

Hey everyone,
@jerrylin96 and I have successfully uploaded a test dataset from HPC to the persistent bucket according to the steps outlined above. 🎉

But I think that my suspicion about the short validity of the token

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

turnes out to be a problem here. @jerrylin96 got an Invalid Credentials 401 Error after about ~1 hr of uploading.

I suspect every chunk written requires a valid authentication and thus most of the datasets we are (and will be) using would require an access token that is valid for a longer time.

@consideRatio is it possible to configure the time the token is valid?

yuvipanda · 2023-03-21T14:57:13Z

I don't really think there's a way to make that token have a longer duration.

I think instead, we should make a separate service account and try to securely provide credentials for that.

consideRatio · 2023-03-21T15:05:56Z

👍 on adding a specific service account to address this short term at least.

I remember reading about ways of getting longer durations for the tokens either for AWS and/or GCP, but that it required cloud configuration to allow for it combined with explicit configuration in the request.

jbusecke · 2023-03-21T15:15:01Z

Thanks for the update!

Getting this figured out would unlock a bunch of people here at LEAP to upload and share datasets with the project (this will definitely also accelerate science) and is thus very high on my internal priorities list.

If there is any way I can help with this, please let me know.

GCS allows individual Google Users as well as Google Groups to have permissions to read / write to GCS buckets (unlike AWS). We can use this to allow community leaders to manage who can read and write to GCS buckets from outside the cloud by managing membership in a Google Group! In this commit, we set up the persistent buckets of the LEAP hubs to have this functionality. Access is managed via a Google Group - I have temporarily created this under the 2i2c org and invited Julius (the community champion) as an administrator. But perhaps it should be just created as a regular google group. Using groups here allows management of this access to not require any 2i2c engineering work. Future work would probably fold the separate variable we have for determining if a bucket is accessible publicly as an attribute as well. Ref https://github.com/2i2c-org/infrastructure/issues/2096

consideRatio changed the title ~~Access to buckets on AWS and GCP~~ Access to buckets on AWS and GCP from local computers Jan 27, 2023

consideRatio mentioned this issue Jan 27, 2023

Recommendations of in-user-image and local computer tools to work against object storage #18

Open

3 tasks

consideRatio self-assigned this Jan 28, 2023

yuvipanda mentioned this issue Mar 22, 2023

Add ability to give users / groups write access to buckets + multi-zone GPUs 2i2c-org/infrastructure#2406

Merged

5 tasks

jbusecke mentioned this issue Jun 5, 2023

Document Data Upload with temporary token leap-stc/leap-stc.github.io#70

Open

2 tasks

jmunroe mentioned this issue Oct 6, 2023

LEAP has less than 10% free disk space 2i2c-org/infrastructure#3230

Closed

This was referenced Jan 31, 2024

How to write to scratch and persistent buckets on a hub? 2i2c-org/infrastructure#3639

Closed

Iterate on docs for working with object storage 2i2c-org/infrastructure#3665

Closed

yuvipanda mentioned this issue Mar 7, 2024

Move all 'product feature ideas' away from this repo's issues into ProductBoard 2i2c-org/infrastructure#3789

Open

yuvipanda transferred this issue from 2i2c-org/infrastructure Apr 1, 2024

consideRatio removed their assignment Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to buckets on AWS and GCP from local computers #22

Access to buckets on AWS and GCP from local computers #22

consideRatio commented Jan 26, 2023 •

edited by yuvipanda

Loading

jbusecke commented Feb 2, 2023

consideRatio commented Feb 3, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023 •

edited

Loading

jbusecke commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023 •

edited

Loading

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Mar 14, 2023 •

edited

Loading

yuvipanda commented Mar 21, 2023

consideRatio commented Mar 21, 2023

jbusecke commented Mar 21, 2023

Access to buckets on AWS and GCP from local computers #22

Access to buckets on AWS and GCP from local computers #22

Comments

consideRatio commented Jan 26, 2023 • edited by yuvipanda Loading

Work items

User requests

Maybe related

Related

jbusecke commented Feb 2, 2023

consideRatio commented Feb 3, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023 • edited Loading

jbusecke commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023 • edited Loading

yuvipanda commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Feb 13, 2023

consideRatio commented Feb 13, 2023

jbusecke commented Feb 13, 2023

jbusecke commented Mar 14, 2023 • edited Loading

yuvipanda commented Mar 21, 2023

consideRatio commented Mar 21, 2023

jbusecke commented Mar 21, 2023

consideRatio commented Jan 26, 2023 •

edited by yuvipanda

Loading

consideRatio commented Feb 13, 2023 •

edited

Loading

jbusecke commented Feb 13, 2023 •

edited

Loading

jbusecke commented Mar 14, 2023 •

edited

Loading