Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEPH Round 02 #31

Merged
merged 5 commits into from
Oct 31, 2024
Merged

CEPH Round 02 #31

merged 5 commits into from
Oct 31, 2024

Conversation

paulocv
Copy link
Contributor

@paulocv paulocv commented Oct 28, 2024

Here is our submission for the SMH round 2, including minor changes to the metadata.
Feel free to use the comments to let us know about any issues with the files.

@LucieContamin
Copy link
Contributor

Hi @paulocv ,

Thank you for the submission, it looks like there is an issue with your submission file. I try to load it, and arrow package returns an error:
Error: Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Would it be possible to verify it, please?

Please let me know if any questions or issues,
Best, Lucie

@paulocv
Copy link
Contributor Author

paulocv commented Oct 28, 2024

Hi @LucieContamin

Thank you for checking. Have you tried to decompress the file first? Since the flat .parquet file exceeds the 25Mb limit imposed by GitHub, and as suggested in the Data Submission Instructions, I submitted a gzip-compressed file.

I was able to reproduce your error message in pyarrow by trying to directly read the file without decompressing, but the decompressed file produced by running (on a unix OS):

gzip -dk 2024-07-28-CEPH-MetaRSV.parquet.gz

opens without problems.

@LucieContamin
Copy link
Contributor

Hi @paulocv ,

Thank you for the answer, and the information.
To add on your comment, you should be able to by-pass the 25Mb limit by using git/github (command line), in this case the hard-limit is 100 Mb. I am happy to help if necessary.

For the compression file, would it be possible to compress it with pyarrow so I can open it with arrow, please? Arrow can normally open compress files. (https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility)
For example, for python:

import pyarrow.parquet as pq
pq.write_table(table, where, compression='gzip')

The loading process of all the data use arrow packages, so it's important to be able to read it with it. Does that make sense?

Please let me know if any questions or issues,
Best, Lucie

@paulocv
Copy link
Contributor Author

paulocv commented Oct 29, 2024

Hi @LucieContamin

Thank you for the instructions, I was able to create the .gz.parquet file directly from pyarrow, as you instructed.

I will clone the repo and submit using git, which I hadn't done so far to avoid cloning the entire repo to my computer.

@LucieContamin
Copy link
Contributor

LucieContamin commented Oct 29, 2024

Thank you for the update,
I totally understand the problem with cloning the full repository, maybe a sparse-checkout and/or cloning without the history might help reduce the size of the clone?

Best, Lucie

Copy link

Run validation on files: 2024-07-28-CEPH-MetaRSV.gz.parquet

Required values:

No missing required value found

Columns:

❌ Error 102: The data frame should contains 9 columns, not 10. Please verify if one or multiple columns have been added.

Scenarios:

No errors or warnings found on scenario name and scenario id columns

Origin Date Column:

No errors or warnings found on the column 'origin_date'

Value and Type Columns:

🟡 Warning 5043: All values associated with output type 'sample' should have a maximum of 1 decimal place

Target Columns:

🟡 Warning 602: No value found associated with the targets: inc hosp (optional), cum hosp (optional), inc inf (optional), cum inf (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak size hosp (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak time hosp (optional); output_type: cdf.

Locations:

No errors or warnings found on Location

Sample:

Column Pairing information:

Run grouping pairing:
No run grouping
Stochastic run pairing:
c("scenario_id", "horizon", "location", "age_group", "index_level_0")
Number of Samples: 300

Quantiles:

No errors or warnings found on quantiles values and format

Age Group:

No errors or warnings found on Age_group

Copy link

Run validation on files: 2024-07-28-CEPH-MetaRSV.gz.parquet

Required values:

No missing required value found

Columns:

No errors or warnings found on the column names and numbers

Scenarios:

No errors or warnings found on scenario name and scenario id columns

Origin Date Column:

No errors or warnings found on the column 'origin_date'

Value and Type Columns:

🟡 Warning 5043: All values associated with output type 'sample' should have a maximum of 1 decimal place

Target Columns:

🟡 Warning 602: No value found associated with the targets: inc hosp (optional), cum hosp (optional), inc inf (optional), cum inf (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak size hosp (optional); output_type: quantile.
🟡 Warning 602: No value found associated with the targets: peak time hosp (optional); output_type: cdf.

Locations:

No errors or warnings found on Location

Sample:

Column Pairing information:

Run grouping pairing:
No run grouping
Stochastic run pairing:
c("scenario_id", "horizon", "location", "age_group")
Number of Samples: 300

Quantiles:

No errors or warnings found on quantiles values and format

Age Group:

No errors or warnings found on Age_group

@@ -11,7 +11,7 @@ model_contributors: [
{
"name": "Shreeya Mhade",
"affiliation": "Indiana University Bloomington",
"email": "[email protected]"
"email": "[email protected]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"email": "[email protected]",
"email": "[email protected]"

The additional comma is not necessary here, please feel free to remove it.

Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission, I just made a small comment on the metadata file, but it does not block the PR.

Please feel free to let me know if any questions or issues,
Best, Lucie

@paulocv
Copy link
Contributor Author

paulocv commented Oct 31, 2024

Thank you @LucieContamin . I removed any trailing commas from the YAML elements.

@LucieContamin LucieContamin merged commit 0226a59 into midas-network:main Oct 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants