Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide prefix sizes #81

Open
ErinWeisbart opened this issue Sep 7, 2023 · 9 comments
Open

Provide prefix sizes #81

ErinWeisbart opened this issue Sep 7, 2023 · 9 comments
Labels

Comments

@ErinWeisbart
Copy link
Contributor

I think it would be helpful to provide a breakdown of data sizes by source/numerical data/image data so people have an idea of what they're getting into before downloading without having to list the bucket themselves.

I'm not sure how much is still in flux, but our dashboard auto-calculated these prefixes current as of right now. I'm happy to flesh out/update.

source images size (TB) workspace size (TB) workspace_dl size (TB) total size (TB)
1 13.2
2 7.6 10.8 21.6
3 16.6 20.6 42.5
4 17.6 17.3 39.1
5 13.1 32 7.4
6 11.7 25.8 43.7
7 14.9
8 7.2 12.1 24.4
9 9.2 17.8 7.1
10 7.5 11.3 21.6
11 10.3 21.6
13 15.8 6.8
@ErinWeisbart
Copy link
Contributor Author

(This is what's in cellpainting-gallery/cpg00016-jump)
I'm planning on providing the total size in the cellpainting-gallery README but I think a by-source breakdown belongs in this repo.

@shntnu shntnu added the cpg0016 label Dec 8, 2023
@ErinWeisbart
Copy link
Contributor Author

FYI when you're ready to add this to a new data release, these can now be quickly and easily calculated with https://github.com/broadinstitute/cpg

@emiglietta
Copy link

I'm working on this (as per the minigrant #8)

On a first run using the the CPG index I got this table (all sizes are in TB). I can tell there are several errors, especially with the workspace and workspace_dl) but it seems to be an issue with the index, as (for example) only cpg0016-jump sources 1, 2 and 7 seem to have their workspace_dl indexed.

I'll look into this and ask Ank, but let me know if this is looking like you were envisioning! @ErinWeisbart :)
Thanks!

dataset_id source_id images_tb workspace_tb workspace_dl_tb total_size_tb
cpg0000-jump-pilot source_4 6 6.1 0 12.2
cpg0001-cellpainting-protocol source_4 18 21.5 0 39.4
cpg0002-jump-scope source_4 0.9 4.2 0 5.1
cpg0004-lincs broad 22 1.6 0 23.5
cpg0005-gerry-bioactivity broad 0.3 0 0 0.3
cpg0006-miami broad 0.6 0.8 0 1.4
cpg0009-molglue broad 0.3 0 0 0.3
cpg0010-caie-drugresponse broad-az 0.1 0 0 0.1
cpg0011-lipocyteprofiler broad 1.2 0 0 1.2
cpg0012-wawer-bioactivecompoundprofiling broad 3 6.9 0 9.9
cpg0014-jump-adipocyte broad 7.7 6.9 0 14.5
cpg0016-jump source_1 4.9 6.2 2.1 13.2
cpg0016-jump source_2 7.6 10.8 3.1 21.6
cpg0016-jump source_3 16.6 20.6 0 37.2
cpg0016-jump source_4 17.6 17.3 0 34.9
cpg0016-jump source_5 13.1 32 0 45.1
cpg0016-jump source_6 11.7 25.8 0 37.5
cpg0016-jump source_7 5.6 5.4 3.9 14.9
cpg0016-jump source_8 7.2 12.1 0 19.3
cpg0016-jump source_9 9.2 17.8 0 27
cpg0016-jump source_10 7.5 11.3 0 18.8
cpg0016-jump source_11 7.6 10.3 0 17.9
cpg0016-jump source_13 6.7 15.6 0 22.3
cpg0017-rohban-pathways broad 0.2 0.1 0 0.3
cpg0018-singh-seedseq broad 0.2 0 0 0.2
cpg0018-singh-seedseq x 0 0 0 0
cpg0019-moshkov-deepprofiler broad 0 0 0 0
cpg0020-varchamp broad 3 0.8 0 3.8
cpg0021-periscope broad 15.6 0.6 0 16.2
cpg0022-cmqtl broad 2 0.8 0 2.8
cpg0024-bortezomib source_4 0.2 0.1 0 0.3
cpg0025-dactyloscopy broad 3.1 0 0 3.1
cpg0026-lacoste_haghighi-rare-diseases broad 2.8 0.6 0 3.4
cpg0028-kelley-resistance broad 1.9 2.2 0 4.1
cpg0029-chroma-pilot broad 0.4 0 0 0.4
cpg0030-gustafsdottir-cellpainting broad 0.2 0 0 0.2
cpg0031-caicedo-cmvip broad 0.6 1.4 0 2
cpg0032-pooled-rare broad 0.1 0 0 0.1
cpg0033-oasis-pilot broad 0.1 0.1 0 0.2
cpg0036-EU-OS-bioactives USC 0.7 0 0 0.7
cpg0036-EU-OS-bioactives FMP 1.3 0 0 1.3
cpg0036-EU-OS-bioactives IMTM 0.7 0 0 0.7
cpg0036-EU-OS-bioactives MEDINA 0.7 0 0 0.7
cpg0037-oasis axiom 16.3 0 0 16.3
jump source_15 13.6 18.7 0 32.3

@ErinWeisbart
Copy link
Contributor Author

Thanks @emiglietta!
For this issue/repo, we just need the cpg0016-jump sizes.
For other prefixes, if you see discrepancies with what's listed in the Cell Painting Gallery README, please file a PR in that repo.
For "missing" data, if it's a parsing issue, please file an issue/PR in https://github.com/broadinstitute/cpg and/or work with @leoank to fix it. If it's because of a naming/organizing/prefix issue within the Gallery, please file an issue in the CPG repo.

@emiglietta
Copy link

Here's the complete table of sizes for cpg0016-jump. The size of the workspace_dl was calculated using AWS CLI since it most of those directories were not parsed as expected when creating the index (see this issue)

source_id images (TB) workspace (TB) workspace_dl (TB) total_size (TB)
source_1 4.9 6.22 2.4 13.52
source_2 7.6 10.84 3.4 21.84
source_3 16.6 20.62 3.8 41.02
source_4 17.6 17.28 2.9 37.78
source_5 13.1 32.05 5.5 50.65
source_6 11.7 25.81 4.5 42.01
source_7 5.6 5.39 4.3 15.29
source_8 7.2 12.08 5.2 24.48
source_9 9.2 17.84 7.9 34.94
source_10 7.5 11.3 2 20.8
source_11 7.6 10.33 4 21.93
source_13 6.7 15.62 7.5 29.82

@ErinWeisbart
Copy link
Contributor Author

Thanks Esteban! Did you write any code for generating these numbers? If so, can you add it as a comment here? (I imagine it's not gigantic?). I ask because I know we have uploads still happening to at least /workspace/segmentation so it would be good to quickly regenerate these numbers sometime in the future after things are relatively static.

@shntnu I was thinking this table might fit in this repo in metadata as data_size.csv. What do you think?

@emiglietta
Copy link

Sure! This is the code I used to get the images and workspace sizes:

from pathlib import Path
import polars as pl

index_dir = Path("PATH_TO_INDEX/cpg_index")

# # Download the index
!aws s3 sync s3://cellpainting-gallery-inventory/cellpainting-gallery/index {index_dir} --exclude "*" --include "*.parquet"

# Load the index using polars (pl)
index_files = [file for file in index_dir.glob("*.parquet")]
index = pl.scan_parquet(index_files)

# (OPTIONAL) Print dictionary of index col names and their respective data types
# print(index.schema)


### Get all sizes and total for cpg0016-jump

# Filter out directories and select relevant columns
filtered_df = (index
               .filter(pl.col("is_dir") == False)
               .filter(
                      pl.col("dataset_id").is_not_null() & 
                      pl.col("source_id").is_not_null()
                  )
            #    .filter(pl.col("dataset_id").eq("cpg0016-jump"))
               .select(["dataset_id", "source_id", "images", "workspace", "workspace_dl", "size"])
               )

summarized_df = (filtered_df
                 .group_by(["dataset_id", "source_id"])
                 .agg([
                     pl.col("size").filter(pl.col("images").str.contains("images")).sum().alias("images_bytes"),
                     pl.col("size").filter(pl.col("workspace").str.contains("workspace")).sum().alias("workspace_bytes"),
                     pl.col("size").filter(pl.col("workspace_dl").str.contains("workspace_dl")).sum().alias("workspace_dl_bytes"),
                     pl.col("size").sum().alias("total_size_bytes")
                 ])
                 )

# Convert total_size_bytes to TB
summarized_df = summarized_df.with_columns([
    (pl.col("images_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("images_tb"),
    (pl.col("workspace_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("workspace_tb"),
    (pl.col("workspace_dl_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("workspace_dl_tb"),
    (pl.col("total_size_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("total_size_tb")
])

# Drop the total_size_bytes columns
summarized_df = summarized_df.drop(["images_bytes", "workspace_bytes", "workspace_dl_bytes", "total_size_bytes"])

# Sort by source
summarized_df = summarized_df.sort(by=pl.col("source_id").str.extract(r"\d+", 0).cast(pl.Int64))
summarized_df = summarized_df.sort(by=[
    "dataset_id",
    pl.col("source_id").str.extract(r"\d+", 0).cast(pl.Int64) #to sort 10 after 2 and not before
])


(summarized_df
 .collect()
 # .write_csv(OUTPUT_DIR).   #OPTIONAL write output to csv
 )

Since there was an issue with the indexing of worspace_dl, I used good old aws s3 ls to get those numbers:

!parallel "echo  source_{} ; aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_{}/workspace_dl/ --summarize --human-readable --recursive | grep Total" ::: 1 2 3 4 5 6 7 8 9 10 11 13

@shntnu
Copy link
Contributor

shntnu commented Jan 7, 2025

was thinking this table might fit in this repo in metadata as data_size.csv. What do you think?

https://github.com/jump-cellpainting/datasets/tree/main/stats would be more appropriate

@ErinWeisbart
Copy link
Contributor Author

even better! thanks @shntnu

@emiglietta would you file a PR for adding data_size.csv in the folder shantunu linked above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants