Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to buckets on AWS and GCP from local computers #22

Open
2 of 6 tasks
consideRatio opened this issue Jan 26, 2023 · 24 comments
Open
2 of 6 tasks

Access to buckets on AWS and GCP from local computers #22

consideRatio opened this issue Jan 26, 2023 · 24 comments

Comments

@consideRatio
Copy link

consideRatio commented Jan 26, 2023

We provide storage buckets for users to write to and setup credentials for them within the user servers started on the hubs. But, what if they want to upload something to those hubs from their local computer or similar - then how do they acquire permissions to do so?

@scottyhq has developed scottyhq/jupyter-cloud-scoped-creds but that currently has support for AWS S3 buckets but not for GCP buckets.

Work items

User requests

Maybe related

Related

@consideRatio consideRatio changed the title Access to buckets on AWS and GCP Access to buckets on AWS and GCP from local computers Jan 27, 2023
@consideRatio consideRatio self-assigned this Jan 28, 2023
@jbusecke
Copy link

jbusecke commented Feb 2, 2023

Would this work for someone on a HPC system aswell? If so that might be a solution to the ticket I opened today (not sure how to link those TBH).

@consideRatio
Copy link
Author

I lack experience of being on HPC systems, but is the difference between "your computer" and "a hpc system" that you just have terminal access - as compared to the ability to open a browser etc?

Then, yes is the answer you seek I think. You can still extract temporary cloud credentials from a hub at 2i2c, this ought to be independent of where you extract them to. And then, these can be used from a terminal on a HPC system using the aws or gcloud cli, or for example at least google's Python cloud storage client.

@jbusecke
Copy link

That sounds good. Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

@consideRatio
Copy link
Author

consideRatio commented Feb 13, 2023

Is there a preliminary implementation of this?We have a few time sensitive tasks which include some form of "upload from HPC" task. Happy to test drive stuff.

If you can verify this workflow @jbusecke, it would be helpful!

  1. Install gcloud (google-cloud-sdk) in the user image you use if its not already installed
  2. Start a user server, enter a terminal, and run: gcloud auth print-access-token
  3. Use the generated token like described in Add subcommand for GCP scottyhq/jupyter-cloud-scoped-creds#2 (comment) from a HPC terminal with gcloud storage cp or similar.

Note that the token lasts for one hour, and that if you re-run the print-access-token command, it will rely on a previous cache I think so it will be one hour since initial generation unless you clear the cache from somewhere in the home folder.

@jbusecke
Copy link

Just to confirm:

Install gcloud (google-cloud-sdk) in the user image you use if its not already installed

This would be on a running server on the hub? And installation is via these instructions?

@yuvipanda
Copy link
Member

@jbusecke yep! But I think on the pangeo-data images, they are probably already installed. They're also available from conda if you prefer https://anaconda.org/conda-forge/google-cloud-sdk

@jbusecke
Copy link

But I think on the pangeo-data images, they are probably already installed.

I just tested gcloud --help and got bash: gcloud: command not found. I believe this means I have to install it? Ill try the conda route.

@jbusecke
Copy link

Ok here are the steps I took:

  1. Installed google-cloud-sdk on my running server with mamba install google-cloud-sdk
  2. Generated token with gcloud auth print-access-token
  3. Copied that token into a local text file on my laptop token.txt
  4. On my laptop I ran
from google.cloud import storage
from google.oauth2.credentials import Credentials

# import an access token
# - option 1: read an access token from a file
with open("token.txt") as f:
    access_token = f.read().strip()

# setup a storage client using credentials
credentials = Credentials(access_token)
storage_client = storage.Client(credentials=credentials)

and got this warning:

/Users/juliusbusecke/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/auth/_default.py:83: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
  1. I then tried to ls my leap scratch bucket:
# test the storage client by trying to list content in a google storage bucket
bucket_name = "leap-scratch/jbusecke"  # don't include gs:// here
blobs = list(storage_client.list_blobs(bucket_name))
print(len(blobs))

which got me an 404 error

--------------------------------------------------------------------------- NotFound Traceback (most recent call last) Cell In[3], line 3 1 # test the storage client by trying to list content in a google storage bucket 2 bucket_name = "leap-scratch/jbusecke" # don't include gs:// here ----> 3 blobs = list(storage_client.list_blobs(bucket_name)) 4 print(len(blobs))

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:208, in Iterator._items_iter(self)
206 def _items_iter(self):
207 """Iterator for each item returned."""
--> 208 for page in self._page_iter(increment=False):
209 for item in page:
210 self.num_results += 1

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:244, in Iterator._page_iter(self, increment)
232 def _page_iter(self, increment):
233 """Generator of pages of API responses.
234
235 Args:
(...)
242 Page: each page of items from the API.
243 """
--> 244 page = self._next_page()
245 while page is not None:
246 self.page_number += 1

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:373, in HTTPIterator._next_page(self)
366 """Get the next page in the iterator.
367
368 Returns:
369 Optional[Page]: The next page in the iterator or :data:None if
370 there are no pages left.
371 """
372 if self._has_next_page():
--> 373 response = self._get_next_page_response()
374 items = response.get(self._items_key, ())
375 page = Page(self, items, self.item_to_value, raw_page=response)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/page_iterator.py:432, in HTTPIterator._get_next_page_response(self)
430 params = self._get_query_params()
431 if self._HTTP_METHOD == "GET":
--> 432 return self.api_request(
433 method=self._HTTP_METHOD, path=self.path, query_params=params
434 )
435 elif self._HTTP_METHOD == "POST":
436 return self.api_request(
437 method=self._HTTP_METHOD, path=self.path, data=params
438 )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/storage/_http.py:72, in Connection.api_request(self, *args, **kwargs)
70 if retry:
71 call = retry(call)
---> 72 return call()

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.call..retry_wrapped_func(*args, **kwargs)
345 target = functools.partial(func, *args, **kwargs)
346 sleep_generator = exponential_sleep_generator(
347 self._initial, self._maximum, multiplier=self._multiplier
348 )
--> 349 return retry_target(
350 target,
351 self._predicate,
352 sleep_generator,
353 self._timeout,
354 on_error=on_error,
355 )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs)
189 for sleep in sleep_generator:
190 try:
--> 191 return target()
193 # pylint: disable=broad-except
194 # This function explicitly must deal with broad exceptions.
195 except Exception as exc:

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/google/cloud/_http/init.py:494, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info)
482 response = self._make_request(
483 method=method,
484 url=url,
(...)
490 extra_api_info=extra_api_info,
491 )
493 if not 200 <= response.status_code < 300:
--> 494 raise exceptions.from_http_response(response)
496 if expect_json and response.content:
497 return response.json()

NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/leap-scratch/jbusecke/o?projection=noAcl&prettyPrint=false: Not Found

Am I using the url path wrong here?

@yuvipanda
Copy link
Member

@jbusecke try the bucket name as just leap-scratch?

@yuvipanda
Copy link
Member

I think you can also use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN, and then use regular gsutil commands to access storage.

@jbusecke
Copy link

jbusecke commented Feb 13, 2023

@jbusecke try the bucket name as just leap-scratch?

Yay! That worked.

use the environment variable CLOUDSDK_AUTH_ACCESS_TOKEN,

As in exporting that on my local machine?

I suppose that for many of the workflows we would want to have a notebook/script on the HPC cluster which creates an xarray object from e.g. many netcdfs and then write a zarr store directly to the bucket (unless this is not a recommended workflow). Is there a way to use this token with gcsfs? I just tried naively:

fs = gcsfs.GCSFileSystem(token=access_token)

which errors with

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 fs = gcsfs.GCSFileSystem(token=access_token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/fsspec/spec.py:76, in _Cached.__call__(cls, *args, **kwargs)
     74     return cls._cache[token]
     75 else:
---> 76     obj = super().__call__(*args, **kwargs)
     77     # Setting _fs_token here causes some static linters to complain.
     78     obj._fs_token_ = token

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/core.py:305, in GCSFileSystem.__init__(self, project, access, token, block_size, consistency, cache_timeout, secure_serialize, check_connection, requests_timeout, requester_pays, asynchronous, session_kwargs, loop, timeout, endpoint_url, default_location, version_aware, **kwargs)
    299 if check_connection:
    300     warnings.warn(
    301         "The `check_connection` argument is deprecated and will be removed in a future release.",
    302         DeprecationWarning,
    303     )
--> 305 self.credentials = GoogleCredentials(project, access, token)

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:50, in GoogleCredentials.__init__(self, project, access, token, check_credentials)
     48 self.lock = threading.Lock()
     49 self.token = token
---> 50 self.connect(method=token)
     52 if check_credentials:
     53     warnings.warn(
     54         "The `check_credentials` argument is deprecated and will be removed in a future release.",
     55         DeprecationWarning,
     56     )

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:226, in GoogleCredentials.connect(self, method)
    207 """
    208 Establish session token. A new token will be requested if the current
    209 one is within 100s of expiry.
   (...)
    215     If None, will try sequence of methods.
    216 """
    217 if method not in [
    218     "google_default",
    219     "cache",
   (...)
    224     None,
    225 ]:
--> 226     self._connect_token(method)
    227 elif method is None:
    228     for meth in ["google_default", "cache", "cloud", "anon"]:

File ~/miniconda/envs/test-gcs-token/lib/python3.10/site-packages/gcsfs/credentials.py:147, in GoogleCredentials._connect_token(self, token)
    145 if isinstance(token, str):
    146     if not os.path.exists(token):
--> 147         raise FileNotFoundError(token)
    148     try:
    149         # is this a "service" token?
    150         self._connect_service(token)

FileNotFoundError: 

and then prints the token 😱, which is not ideal

@yuvipanda
Copy link
Member

Looking at https://gcsfs.readthedocs.io/en/latest/#credentials, looks like you can pass the Credentials object with the token in it rather than the string.

@jbusecke
Copy link

Amazing. To wrap up what I did:
Steps 1-4 as above.

Then

import gcsfs
import xarray as xr
fs = gcsfs.GCSFileSystem(token=credentials)
ds = xr.DataArray([1]).to_dataset(name='test')
mapper = fs.get_mapper('leap-scratch/jbusecke/test_offsite_upload.zarr')
ds.to_zarr(mapper)

and I confirmed that the zarr array was written:
image

@jbusecke
Copy link

This is awesome! Thanks.

I will try this tomorrow with a collaborator. One last question. The collaborator should extract the token from their account, correct?

@jbusecke
Copy link

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

@consideRatio
Copy link
Author

I anticipate the 1 hour limit to become a bottleneck for larger datasets in the future. If that could be relaxed somehow in the future I believe that would be very useful.

I'm not confident you get shut down if the token expires, the token can't be checket at every byte sent etc - so when is it checked? Is it checked in between each object uploaded by for example gsutil, or between each request made?

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation! I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

@jbusecke
Copy link

Also is it better to not assign the token to a variable for security reasons?

E.g.

with open("token.txt") as f:
    # setup a storage client using credentials
    credentials = Credentials(f.read().strip())

Then again this only lives for 1 hour, so the risk is not particularly high I guess.

Another comment re security: I noticed that with this credential I can also delete files. I did fs.rm('leap-scratch/jbusecke/test_offsite_upload.zarr', recursive=True). Wondering if there is a way for write/read only permissions to avoid mishaps for novel users.

@jbusecke
Copy link

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

@jbusecke if you come to practical conclusions about this, thats also very relevant to capture in documentation!

I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

@consideRatio
Copy link
Author

@jbusecke I'd like this issue to stay scoped to how to extract short lived credentials matching those provided to you as a user on the user server provided by the hub.

For another, one can consider if its feasible for 2i2c to help provide read-only credentials to a few users and read/write to others, but its additional unrelated customizations on the credentials provided in the first place to the user server.


I will absolutely write up some docs once we have prototyped this. I assume this should go into the 2i2c docs (with leap linking there from our docs?)

I'd like these docs to live in scottyhq/jupyter-cloud-scoped-creds as a project, without assumptions of coupling to 2i2c or similar. I've proposed its a project that we help get into the jupyterhub github org in the long run etc also.

Do you think this is also valid [...]

No clue!

@jbusecke
Copy link

Sounds good @consideRatio. Ill report back how our testing goes tomorrow.

@jbusecke
Copy link

jbusecke commented Mar 14, 2023

Hey everyone,
@jerrylin96 and I have successfully uploaded a test dataset from HPC to the persistent bucket according to the steps outlined above. 🎉

But I think that my suspicion about the short validity of the token

I think its likeley that if a very large object is being copied, that large object gets copied all the way even if it takes 2 hours.

Do you think this is also valid if the zarr store is written in many small chunks (~100-200MB) in a streaming fashion rather than uploading a large gzip file? I guess this will be a good test to perform.

turnes out to be a problem here. @jerrylin96 got an Invalid Credentials 401 Error after about ~1 hr of uploading.

image

I suspect every chunk written requires a valid authentication and thus most of the datasets we are (and will be) using would require an access token that is valid for a longer time.

@consideRatio is it possible to configure the time the token is valid?

@yuvipanda
Copy link
Member

I don't really think there's a way to make that token have a longer duration.

I think instead, we should make a separate service account and try to securely provide credentials for that.

@consideRatio
Copy link
Author

👍 on adding a specific service account to address this short term at least.

I remember reading about ways of getting longer durations for the tokens either for AWS and/or GCP, but that it required cloud configuration to allow for it combined with explicit configuration in the request.

@jbusecke
Copy link

Thanks for the update!

Getting this figured out would unlock a bunch of people here at LEAP to upload and share datasets with the project (this will definitely also accelerate science) and is thus very high on my internal priorities list.

If there is any way I can help with this, please let me know.

yuvipanda referenced this issue in yuvipanda/pilot-hubs Mar 22, 2023
GCS allows individual Google Users as well as Google Groups
to have permissions to read / write to GCS buckets (unlike AWS).
We can use this to allow community leaders to manage who can read
and write to GCS buckets from outside the cloud by managing membership
in a Google Group!

In this commit, we set up the persistent buckets of the LEAP hubs
to have this functionality. Access is managed via a Google Group -
I have temporarily created this under the 2i2c org and invited
Julius (the community champion) as an administrator. But perhaps
it should be just created as a regular google group. Using groups
here allows management of this access to not require any 2i2c
engineering work.

Future work would probably fold the separate variable we have
for determining if a bucket is accessible publicly as an attribute
as well.

Ref https://github.com/2i2c-org/infrastructure/issues/2096
yuvipanda referenced this issue in yuvipanda/pilot-hubs Apr 3, 2023
GCS allows individual Google Users as well as Google Groups
to have permissions to read / write to GCS buckets (unlike AWS).
We can use this to allow community leaders to manage who can read
and write to GCS buckets from outside the cloud by managing membership
in a Google Group!

In this commit, we set up the persistent buckets of the LEAP hubs
to have this functionality. Access is managed via a Google Group -
I have temporarily created this under the 2i2c org and invited
Julius (the community champion) as an administrator. But perhaps
it should be just created as a regular google group. Using groups
here allows management of this access to not require any 2i2c
engineering work.

Future work would probably fold the separate variable we have
for determining if a bucket is accessible publicly as an attribute
as well.

Ref https://github.com/2i2c-org/infrastructure/issues/2096
yuvipanda referenced this issue in yuvipanda/pilot-hubs May 18, 2023
GCS allows individual Google Users as well as Google Groups
to have permissions to read / write to GCS buckets (unlike AWS).
We can use this to allow community leaders to manage who can read
and write to GCS buckets from outside the cloud by managing membership
in a Google Group!

In this commit, we set up the persistent buckets of the LEAP hubs
to have this functionality. Access is managed via a Google Group -
I have temporarily created this under the 2i2c org and invited
Julius (the community champion) as an administrator. But perhaps
it should be just created as a regular google group. Using groups
here allows management of this access to not require any 2i2c
engineering work.

Future work would probably fold the separate variable we have
for determining if a bucket is accessible publicly as an attribute
as well.

Ref https://github.com/2i2c-org/infrastructure/issues/2096
@yuvipanda yuvipanda transferred this issue from 2i2c-org/infrastructure Apr 1, 2024
@consideRatio consideRatio removed their assignment Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants