Provide prefix sizes #81

ErinWeisbart · 2023-09-07T17:50:17Z

I think it would be helpful to provide a breakdown of data sizes by source/numerical data/image data so people have an idea of what they're getting into before downloading without having to list the bucket themselves.

I'm not sure how much is still in flux, but our dashboard auto-calculated these prefixes current as of right now. I'm happy to flesh out/update.

source	images size (TB)	workspace size (TB)	workspace_dl size (TB)	total size (TB)
1				13.2
2	7.6	10.8		21.6
3	16.6	20.6		42.5
4	17.6	17.3		39.1
5	13.1	32	7.4
6	11.7	25.8		43.7
7				14.9
8	7.2	12.1		24.4
9	9.2	17.8	7.1
10	7.5	11.3		21.6
11		10.3		21.6
13		15.8	6.8

ErinWeisbart · 2023-09-07T17:53:21Z

(This is what's in cellpainting-gallery/cpg00016-jump)
I'm planning on providing the total size in the cellpainting-gallery README but I think a by-source breakdown belongs in this repo.

ErinWeisbart · 2024-01-29T19:51:37Z

FYI when you're ready to add this to a new data release, these can now be quickly and easily calculated with https://github.com/broadinstitute/cpg

emiglietta · 2025-01-03T23:40:14Z

I'm working on this (as per the minigrant #8)

On a first run using the the CPG index I got this table (all sizes are in TB). I can tell there are several errors, especially with the workspace and workspace_dl) but it seems to be an issue with the index, as (for example) only cpg0016-jump sources 1, 2 and 7 seem to have their workspace_dl indexed.

I'll look into this and ask Ank, but let me know if this is looking like you were envisioning! @ErinWeisbart :)
Thanks!

dataset_id	source_id	images_tb	workspace_tb	workspace_dl_tb	total_size_tb
cpg0000-jump-pilot	source_4	6	6.1	0	12.2
cpg0001-cellpainting-protocol	source_4	18	21.5	0	39.4
cpg0002-jump-scope	source_4	0.9	4.2	0	5.1
cpg0004-lincs	broad	22	1.6	0	23.5
cpg0005-gerry-bioactivity	broad	0.3	0	0	0.3
cpg0006-miami	broad	0.6	0.8	0	1.4
cpg0009-molglue	broad	0.3	0	0	0.3
cpg0010-caie-drugresponse	broad-az	0.1	0	0	0.1
cpg0011-lipocyteprofiler	broad	1.2	0	0	1.2
cpg0012-wawer-bioactivecompoundprofiling	broad	3	6.9	0	9.9
cpg0014-jump-adipocyte	broad	7.7	6.9	0	14.5
cpg0016-jump	source_1	4.9	6.2	2.1	13.2
cpg0016-jump	source_2	7.6	10.8	3.1	21.6
cpg0016-jump	source_3	16.6	20.6	0	37.2
cpg0016-jump	source_4	17.6	17.3	0	34.9
cpg0016-jump	source_5	13.1	32	0	45.1
cpg0016-jump	source_6	11.7	25.8	0	37.5
cpg0016-jump	source_7	5.6	5.4	3.9	14.9
cpg0016-jump	source_8	7.2	12.1	0	19.3
cpg0016-jump	source_9	9.2	17.8	0	27
cpg0016-jump	source_10	7.5	11.3	0	18.8
cpg0016-jump	source_11	7.6	10.3	0	17.9
cpg0016-jump	source_13	6.7	15.6	0	22.3
cpg0017-rohban-pathways	broad	0.2	0.1	0	0.3
cpg0018-singh-seedseq	broad	0.2	0	0	0.2
cpg0018-singh-seedseq	x	0	0	0	0
cpg0019-moshkov-deepprofiler	broad	0	0	0	0
cpg0020-varchamp	broad	3	0.8	0	3.8
cpg0021-periscope	broad	15.6	0.6	0	16.2
cpg0022-cmqtl	broad	2	0.8	0	2.8
cpg0024-bortezomib	source_4	0.2	0.1	0	0.3
cpg0025-dactyloscopy	broad	3.1	0	0	3.1
cpg0026-lacoste_haghighi-rare-diseases	broad	2.8	0.6	0	3.4
cpg0028-kelley-resistance	broad	1.9	2.2	0	4.1
cpg0029-chroma-pilot	broad	0.4	0	0	0.4
cpg0030-gustafsdottir-cellpainting	broad	0.2	0	0	0.2
cpg0031-caicedo-cmvip	broad	0.6	1.4	0	2
cpg0032-pooled-rare	broad	0.1	0	0	0.1
cpg0033-oasis-pilot	broad	0.1	0.1	0	0.2
cpg0036-EU-OS-bioactives	USC	0.7	0	0	0.7
cpg0036-EU-OS-bioactives	FMP	1.3	0	0	1.3
cpg0036-EU-OS-bioactives	IMTM	0.7	0	0	0.7
cpg0036-EU-OS-bioactives	MEDINA	0.7	0	0	0.7
cpg0037-oasis	axiom	16.3	0	0	16.3
jump	source_15	13.6	18.7	0	32.3

ErinWeisbart · 2025-01-06T17:52:54Z

Thanks @emiglietta!
For this issue/repo, we just need the cpg0016-jump sizes.
For other prefixes, if you see discrepancies with what's listed in the Cell Painting Gallery README, please file a PR in that repo.
For "missing" data, if it's a parsing issue, please file an issue/PR in https://github.com/broadinstitute/cpg and/or work with @leoank to fix it. If it's because of a naming/organizing/prefix issue within the Gallery, please file an issue in the CPG repo.

emiglietta · 2025-01-07T00:00:39Z

Here's the complete table of sizes for cpg0016-jump. The size of the workspace_dl was calculated using AWS CLI since it most of those directories were not parsed as expected when creating the index (see this issue)

source_id	images (TB)	workspace (TB)	workspace_dl (TB)	total_size (TB)
source_1	4.9	6.22	2.4	13.52
source_2	7.6	10.84	3.4	21.84
source_3	16.6	20.62	3.8	41.02
source_4	17.6	17.28	2.9	37.78
source_5	13.1	32.05	5.5	50.65
source_6	11.7	25.81	4.5	42.01
source_7	5.6	5.39	4.3	15.29
source_8	7.2	12.08	5.2	24.48
source_9	9.2	17.84	7.9	34.94
source_10	7.5	11.3	2	20.8
source_11	7.6	10.33	4	21.93
source_13	6.7	15.62	7.5	29.82

ErinWeisbart · 2025-01-07T18:03:31Z

Thanks Esteban! Did you write any code for generating these numbers? If so, can you add it as a comment here? (I imagine it's not gigantic?). I ask because I know we have uploads still happening to at least /workspace/segmentation so it would be good to quickly regenerate these numbers sometime in the future after things are relatively static.

@shntnu I was thinking this table might fit in this repo in metadata as data_size.csv. What do you think?

emiglietta · 2025-01-07T19:52:08Z

Sure! This is the code I used to get the images and workspace sizes:

from pathlib import Path
import polars as pl

index_dir = Path("PATH_TO_INDEX/cpg_index")

# # Download the index
!aws s3 sync s3://cellpainting-gallery-inventory/cellpainting-gallery/index {index_dir} --exclude "*" --include "*.parquet"

# Load the index using polars (pl)
index_files = [file for file in index_dir.glob("*.parquet")]
index = pl.scan_parquet(index_files)

# (OPTIONAL) Print dictionary of index col names and their respective data types
# print(index.schema)


### Get all sizes and total for cpg0016-jump

# Filter out directories and select relevant columns
filtered_df = (index
               .filter(pl.col("is_dir") == False)
               .filter(
                      pl.col("dataset_id").is_not_null() & 
                      pl.col("source_id").is_not_null()
                  )
            #    .filter(pl.col("dataset_id").eq("cpg0016-jump"))
               .select(["dataset_id", "source_id", "images", "workspace", "workspace_dl", "size"])
               )

summarized_df = (filtered_df
                 .group_by(["dataset_id", "source_id"])
                 .agg([
                     pl.col("size").filter(pl.col("images").str.contains("images")).sum().alias("images_bytes"),
                     pl.col("size").filter(pl.col("workspace").str.contains("workspace")).sum().alias("workspace_bytes"),
                     pl.col("size").filter(pl.col("workspace_dl").str.contains("workspace_dl")).sum().alias("workspace_dl_bytes"),
                     pl.col("size").sum().alias("total_size_bytes")
                 ])
                 )

# Convert total_size_bytes to TB
summarized_df = summarized_df.with_columns([
    (pl.col("images_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("images_tb"),
    (pl.col("workspace_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("workspace_tb"),
    (pl.col("workspace_dl_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("workspace_dl_tb"),
    (pl.col("total_size_bytes") / pl.lit(1024 * 1024 * 1024 * 1024)).round(2).alias("total_size_tb")
])

# Drop the total_size_bytes columns
summarized_df = summarized_df.drop(["images_bytes", "workspace_bytes", "workspace_dl_bytes", "total_size_bytes"])

# Sort by source
summarized_df = summarized_df.sort(by=pl.col("source_id").str.extract(r"\d+", 0).cast(pl.Int64))
summarized_df = summarized_df.sort(by=[
    "dataset_id",
    pl.col("source_id").str.extract(r"\d+", 0).cast(pl.Int64) #to sort 10 after 2 and not before
])


(summarized_df
 .collect()
 # .write_csv(OUTPUT_DIR).   #OPTIONAL write output to csv
 )

Since there was an issue with the indexing of worspace_dl, I used good old aws s3 ls to get those numbers:

!parallel "echo  source_{} ; aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_{}/workspace_dl/ --summarize --human-readable --recursive | grep Total" ::: 1 2 3 4 5 6 7 8 9 10 11 13

shntnu · 2025-01-07T23:01:52Z

was thinking this table might fit in this repo in metadata as data_size.csv. What do you think?

https://github.com/jump-cellpainting/datasets/tree/main/stats would be more appropriate

ErinWeisbart · 2025-01-07T23:04:23Z

even better! thanks @shntnu

@emiglietta would you file a PR for adding data_size.csv in the folder shantunu linked above?

shntnu added the cpg0016 label Dec 8, 2023

emiglietta mentioned this issue Jan 6, 2025

workspace_dl files not indexed properly broadinstitute/cpg#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide prefix sizes #81

Provide prefix sizes #81

ErinWeisbart commented Sep 7, 2023

ErinWeisbart commented Sep 7, 2023

ErinWeisbart commented Jan 29, 2024

emiglietta commented Jan 3, 2025

ErinWeisbart commented Jan 6, 2025

emiglietta commented Jan 7, 2025

ErinWeisbart commented Jan 7, 2025

emiglietta commented Jan 7, 2025

shntnu commented Jan 7, 2025

ErinWeisbart commented Jan 7, 2025

Provide prefix sizes #81

Provide prefix sizes #81

Comments

ErinWeisbart commented Sep 7, 2023

ErinWeisbart commented Sep 7, 2023

ErinWeisbart commented Jan 29, 2024

emiglietta commented Jan 3, 2025

ErinWeisbart commented Jan 6, 2025

emiglietta commented Jan 7, 2025

ErinWeisbart commented Jan 7, 2025

emiglietta commented Jan 7, 2025

shntnu commented Jan 7, 2025

ErinWeisbart commented Jan 7, 2025