Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement S3 storage class for Streaming and External basic transfer adapters #3

Closed
shevron opened this issue Apr 4, 2020 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@shevron
Copy link
Contributor

shevron commented Apr 4, 2020

We can relatively easily implement S3 support for both streamed and direct-to-cloud storage modeled after our current Azure backend.

Can use https://github.com/datopian/node-git-lfs/tree/master/lib/store as a reference (although our architecture is different, the usage of S3 APIs should be similar).

@shevron shevron added enhancement New feature or request good first issue Good for newcomers labels Apr 4, 2020
@tomeksabala
Copy link
Contributor

This is a requirement for my team's project. We're in the middle of working on AWS S3 storage implementation.
I'll gladly submit a PR once we have a working draft for a review and possible merge.

@shevron
Copy link
Contributor Author

shevron commented Mar 8, 2021

@tomeksabala that would be awesome!

@tomeksabala
Copy link
Contributor

Thanks @shevron . I hope you don't mind if I ask you a few questions about the feature.

While trying to implement ExternalStorage interface I stomped on a problem with get_upload_action method. I couldn't find any clear piece of documentation on the response dict that should be returned and how it's going to be used afterwards.
Azure and Google cloud implementation are returning a POST url with all fields url encoded.

My draft implementation looks like the following:

    def get_upload_action(self, prefix: str, oid: str, size: int, expires_in: int,
                          extra: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
        response = self.s3_client.generate_presigned_post(self.aws_s3_bucket_name,
                                                          self._get_blob_path(prefix, oid),
                                                          ExpiresIn=expires_in
                                                          )
        params = urllib.parse.urlencode(response['fields'])
        href = f"{response['url']}?{params}"
        return {
            "actions": {
                "upload": {
                    "href": href,
                    "header": {},
                    "expires_in": expires_in
                }
            }
        }

AWS S3 documentation has good examples on how to generate presigned post requests. They however return a set of fields which should be added to the request next to file to be uploaded: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html#generating-a-presigned-url-to-upload-a-file

This just seems odd. Could you explain why other Storage implementations are urlencoding fields data instead of passing them as usual POST form-data content type?

@shevron
Copy link
Contributor Author

shevron commented Mar 9, 2021

@tomeksabala of course. The dict that is returned in basic transfer mode should follow the Git LFS protocol for upload actions - basically it should have href, header and expires_in as you see in the other adapters, and the href should be for a URL to send a PUT request to (not POST!), without form-data encoding, but just a PUT request with the file contents itself in the body.

If you can get AWS to sign a PUT and not a POST / form-data request, that should be the right way to do it. IIRC S3 has that API - I think this one can work.

Git LFS clients expect (at least in basic transfer mode) to get a URL that they can send a PUT to and just send the file contents, without any special encoding like form-data. You can specify additional headers to that request in the header dict, and of course any query-string parameters in the URL itself, but the body is not something you can specify - it should just be the file itself.

Note that this does no longer applies if you're implementing MultipartStorage, but that is a different case (and I think this should be a separate ticket if ever implemented for S3, after we see ExternalStorage is working properly).

@tomeksabala
Copy link
Contributor

@shevron This is very helpful, thanks! I've found a piece of docs explaining the basic transfer mode which puts some light on the issue.

I was planning to implement MultipartStorage too but if I understood correctly you'd prefer to have it done in a separate PR?

@shevron
Copy link
Contributor Author

shevron commented Mar 10, 2021

@tomeksabala If the PR doesn't become too big I'm totally fine with implementing it all in one go - up to you 👍

@tomeksabala
Copy link
Contributor

@shevron Thank you once again for all the help. I did manage to finish my work this week and I've submitted a PR.

@shevron
Copy link
Contributor Author

shevron commented Mar 15, 2021

Implemented by @tomeksabala in #81

@shevron shevron closed this as completed Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants