-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide prefix sizes #81
Comments
(This is what's in cellpainting-gallery/cpg00016-jump) |
FYI when you're ready to add this to a new data release, these can now be quickly and easily calculated with https://github.com/broadinstitute/cpg |
I'm working on this (as per the minigrant #8) On a first run using the the CPG index I got this table (all sizes are in TB). I can tell there are several errors, especially with the workspace and workspace_dl) but it seems to be an issue with the index, as (for example) only cpg0016-jump sources 1, 2 and 7 seem to have their workspace_dl indexed. I'll look into this and ask Ank, but let me know if this is looking like you were envisioning! @ErinWeisbart :)
|
Thanks @emiglietta! |
Here's the complete table of sizes for cpg0016-jump. The size of the workspace_dl was calculated using AWS CLI since it most of those directories were not parsed as expected when creating the index (see this issue)
|
Thanks Esteban! Did you write any code for generating these numbers? If so, can you add it as a comment here? (I imagine it's not gigantic?). I ask because I know we have uploads still happening to at least @shntnu I was thinking this table might fit in this repo in |
Sure! This is the code I used to get the images and workspace sizes:
Since there was an issue with the indexing of worspace_dl, I used good old aws s3 ls to get those numbers:
|
https://github.com/jump-cellpainting/datasets/tree/main/stats would be more appropriate |
even better! thanks @shntnu @emiglietta would you file a PR for adding data_size.csv in the folder shantunu linked above? |
I think it would be helpful to provide a breakdown of data sizes by source/numerical data/image data so people have an idea of what they're getting into before downloading without having to list the bucket themselves.
I'm not sure how much is still in flux, but our dashboard auto-calculated these prefixes current as of right now. I'm happy to flesh out/update.
The text was updated successfully, but these errors were encountered: