-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface Nextclade version + Nextclade dataset version in final metadata output #458
Comments
One specific example of the data we want to communicate would be:
Which would allow users to install the specific Nextclade software (e.g., One specific implementation of that implementation could be a JSON file named like {
"nextclade_version": "3.8.0",
"nextclade_dataset_name": "sars-cov-2",
"nextclade_dataset_version": "2024-07-03--08-29-55Z"
} Another implementation could be storing that information in the file name of the metadata like Another approach would be to nest the metadata in a directory structure with the information like One nice aspect of using an additional details file is that the metadata URI and the details URI would be stable. Decoupling the metadata from the details could also make the two files inconsistent for some period of time during updates. Encoding the information in the filename nicely couples the metadata contents with its version details. The format is more ambiguous to parse, but it isn't complicated once you know what the underscored-delimited fields are. |
Storing version details of the cache in a separate file makes sense, makes it easy to determine quickly whether the cache should automatically be invalidated. Regarding putting version info in metadata:
Embedding in path makes sense as well if we want to avoid adding 2/3 columns that are essentially always identical. Though overall overhead is small given the size of our existing rows and their compressibility. |
I shared the three main options discussed in this issue with Evan Ray and folks from the forecasting hub group as follows:
Evan replied that:
|
Re: option 2, can we include a hash of the metadata file in the meta-metadata JSON, so one can ensure the files are matched up? |
If we don't want to track it in a separate file, the other option is to add it to the AWS S3 user defined object metadata. We already use this to store the sha256sum in upload-to-s3, which can be accessed with
|
To slightly complicate things, we are technically running Nextclade twice with different Nextclade datasets. We use the clade + QC metrics from the general SARS-CoV-2 dataset, but we spike in This is probably not a huge issue since we only care about the clade assignments. {
"nextclade_version": "3.8.0",
"nextclade_datasets": [
{
"name": "sars-cov-2",
"version": "2024-07-03--08-29-55Z",
"columns": [
"clade_nextstrain",
"clade_who",
"Nextclade_pango",
"missing_data",
"divergence",
"nonACGTN",
"coverage",
"rare_mutations",
"reversion_mutations",
"potential_contaminants",
"QC_missing_data",
"QC_mixed_sites",
"QC_rare_mutations",
"QC_snp_clusters",
"QC_frame_shifts",
"QC_stop_codons",
"QC_overall_score",
"QC_overall_status",
"frame_shifts",
"deletions",
"insertions",
"substitutions",
"aaSubstitutions"
]
},
{
"name": "sars-cov-2-21L",
"version": "2024-07-03--08-29-55Z",
"columns": [
"immune_escape",
"ace2_binding"
]
}
]
} |
@joverlee521 Good call. It would be slick and probably helpful in the long term to know which columns came from which Nextclade dataset. Like you mentioned, we only care about the clade assignments in the short term. If we stick with the simpler JSON format example above, your example here shows the need for a way to migrate the JSON schema over time. So maybe we at least need a schema version in the simple JSON like this? {
"json_schema_version": "v1",
"nextclade_version": "3.8.0",
"nextclade_dataset_name": "sars-cov-2",
"nextclade_dataset_version": "2024-07-03--08-29-55Z"
} |
You technically only need a schema for version 2 as version 1 can be
defined implicitly as the schema with no explicit version
…On Tue, Jul 23, 2024, 18:48 John Huddleston ***@***.***> wrote:
@joverlee521 <https://github.com/joverlee521> Good call. It would be
slick and probably helpful in the long term to know which columns came from
which Nextclade dataset. Like you mentioned, we only care about the clade
assignments in the short term. If we stick with the simpler JSON format
example above, your example here shows the need for a way to migrate the
JSON schema over time. So maybe we at least need a schema version in the
simple JSON like this?
{
"json_schema_version": "v1",
"nextclade_version": "3.8.0",
"nextclade_dataset_name": "sars-cov-2",
"nextclade_dataset_version": "2024-07-03--08-29-55Z"
}
—
Reply to this email directly, view it on GitHub
<#458 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF77AQNVNBGRY6T5DPSAA63ZN2CMZAVCNFSM6AAAAABKRULZJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG42DCNRUGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
+1 for explicit version numbers 🙌 maybe consider calling it |
The public metadata version file is available at https://data.nextstrain.org/files/ncov/open/metadata_version.json |
Prompted by discussion in blab/forecasting project
Naively, we could include the Nextclade version and Nextclade dataset version in the join-metadata-and-clades.
However, if #457 is implemented, then the metadata comes from a single Nextclade version/dataset version. Then these versions should be surfaced through the file name or file metadata.
The text was updated successfully, but these errors were encountered: