CEPH Round 02 #31

paulocv · 2024-10-28T20:17:57Z

Here is our submission for the SMH round 2, including minor changes to the metadata.
Feel free to use the comments to let us know about any issues with the files.

LucieContamin · 2024-10-28T20:47:30Z

Hi @paulocv ,

Thank you for the submission, it looks like there is an issue with your submission file. I try to load it, and arrow package returns an error:
Error: Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Would it be possible to verify it, please?

Please let me know if any questions or issues,
Best, Lucie

paulocv · 2024-10-28T21:50:40Z

Hi @LucieContamin

Thank you for checking. Have you tried to decompress the file first? Since the flat .parquet file exceeds the 25Mb limit imposed by GitHub, and as suggested in the Data Submission Instructions, I submitted a gzip-compressed file.

I was able to reproduce your error message in pyarrow by trying to directly read the file without decompressing, but the decompressed file produced by running (on a unix OS):

gzip -dk 2024-07-28-CEPH-MetaRSV.parquet.gz

opens without problems.

LucieContamin · 2024-10-29T12:40:57Z

Hi @paulocv ,

Thank you for the answer, and the information.
To add on your comment, you should be able to by-pass the 25Mb limit by using git/github (command line), in this case the hard-limit is 100 Mb. I am happy to help if necessary.

For the compression file, would it be possible to compress it with pyarrow so I can open it with arrow, please? Arrow can normally open compress files. (https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility)
For example, for python:

import pyarrow.parquet as pq
pq.write_table(table, where, compression='gzip')

The loading process of all the data use arrow packages, so it's important to be able to read it with it. Does that make sense?

Please let me know if any questions or issues,
Best, Lucie

paulocv · 2024-10-29T13:27:52Z

Hi @LucieContamin

Thank you for the instructions, I was able to create the .gz.parquet file directly from pyarrow, as you instructed.

I will clone the repo and submit using git, which I hadn't done so far to avoid cloning the entire repo to my computer.

LucieContamin · 2024-10-29T13:30:11Z

Thank you for the update,
I totally understand the problem with cloning the full repository, maybe a sparse-checkout and/or cloning without the history might help reduce the size of the clone?

Best, Lucie

github-actions · 2024-10-29T14:59:22Z

Run validation on files: 2024-07-28-CEPH-MetaRSV.gz.parquet

Required values:

No missing required value found

Columns:

❌ Error 102: The data frame should contains 9 columns, not 10. Please verify if one or multiple columns have been added.

Scenarios:

No errors or warnings found on scenario name and scenario id columns

Origin Date Column:

No errors or warnings found on the column 'origin_date'

Value and Type Columns:

🟡 Warning 5043: All values associated with output type 'sample' should have a maximum of 1 decimal place

Target Columns:

🟡 Warning 602: No value found associated with the targets: inc hosp (optional), cum hosp (optional), inc inf (optional), cum inf (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak size hosp (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak time hosp (optional); output_type: cdf.

Locations:

No errors or warnings found on Location

Sample:

Column Pairing information:

Run grouping pairing:
No run grouping
Stochastic run pairing:
c("scenario_id", "horizon", "location", "age_group", "index_level_0")
Number of Samples: 300

Quantiles:

No errors or warnings found on quantiles values and format

Age Group:

No errors or warnings found on Age_group

github-actions · 2024-10-29T15:15:35Z

Run validation on files: 2024-07-28-CEPH-MetaRSV.gz.parquet

Required values:

No missing required value found

Columns:

No errors or warnings found on the column names and numbers

Scenarios:

No errors or warnings found on scenario name and scenario id columns

Origin Date Column:

No errors or warnings found on the column 'origin_date'

Value and Type Columns:

🟡 Warning 5043: All values associated with output type 'sample' should have a maximum of 1 decimal place

Target Columns:

🟡 Warning 602: No value found associated with the targets: inc hosp (optional), cum hosp (optional), inc inf (optional), cum inf (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak size hosp (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak time hosp (optional); output_type: cdf.

Locations:

No errors or warnings found on Location

Sample:

Column Pairing information:

Run grouping pairing:
No run grouping
Stochastic run pairing:
c("scenario_id", "horizon", "location", "age_group")
Number of Samples: 300

Quantiles:

No errors or warnings found on quantiles values and format

Age Group:

No errors or warnings found on Age_group

LucieContamin · 2024-10-30T13:01:57Z

model-metadata/CEPH-MetaRSV.yaml

@@ -11,7 +11,7 @@ model_contributors: [
    {
        "name": "Shreeya Mhade",
        "affiliation": "Indiana University Bloomington",
-        "email": "[email protected]"
+        "email": "[email protected]",


Suggested change

"email": "[email protected]",

"email": "[email protected]"

The additional comma is not necessary here, please feel free to remove it.

LucieContamin

Thanks for the submission, I just made a small comment on the metadata file, but it does not block the PR.

Please feel free to let me know if any questions or issues,
Best, Lucie

paulocv · 2024-10-31T12:52:21Z

Thank you @LucieContamin . I removed any trailing commas from the YAML elements.

paulocv added 2 commits October 28, 2024 15:11

Update CEPH metadata for round 02

d4470a8

CEPH Round 02 initial submission

4a07888

Replace by a .gz.parquet file compressed with pyarrow

cadf646

Try exporting without index

0c21bfb

LucieContamin reviewed Oct 30, 2024

View reviewed changes

LucieContamin approved these changes Oct 30, 2024

View reviewed changes

Remove trailing commas from metadata

9d0dc17

LucieContamin merged commit 0226a59 into midas-network:main Oct 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CEPH Round 02 #31

CEPH Round 02 #31

paulocv commented Oct 28, 2024

LucieContamin commented Oct 28, 2024

paulocv commented Oct 28, 2024

LucieContamin commented Oct 29, 2024

paulocv commented Oct 29, 2024

LucieContamin commented Oct 29, 2024 •

edited

Loading

github-actions bot commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

LucieContamin Oct 30, 2024

LucieContamin left a comment

paulocv commented Oct 31, 2024

CEPH Round 02 #31

CEPH Round 02 #31

Conversation

paulocv commented Oct 28, 2024

LucieContamin commented Oct 28, 2024

paulocv commented Oct 28, 2024

LucieContamin commented Oct 29, 2024

paulocv commented Oct 29, 2024

LucieContamin commented Oct 29, 2024 • edited Loading

github-actions bot commented Oct 29, 2024

Required values:

Columns:

Scenarios:

Origin Date Column:

Value and Type Columns:

Target Columns:

Locations:

Sample:

Column Pairing information:

Quantiles:

Age Group:

github-actions bot commented Oct 29, 2024

Required values:

Columns:

Scenarios:

Origin Date Column:

Value and Type Columns:

Target Columns:

Locations:

Sample:

Column Pairing information:

Quantiles:

Age Group:

LucieContamin Oct 30, 2024

Choose a reason for hiding this comment

LucieContamin left a comment

Choose a reason for hiding this comment

paulocv commented Oct 31, 2024

LucieContamin commented Oct 29, 2024 •

edited

Loading