-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed/wrong nextclade.tsv #392
Comments
Oh, and there are entries with the same
|
Thanks for reporting this issue @chaoran-chen! This is most likely due to the release of Nextclade 2.12.0 yesterday that added new columns to the CSV/TSV output. Our current cache system blindly concatenates new Nextclade runs to the existing cache so the mismatch in the number of column would lead to the malformed nextclade TSV. I'm going to re-run our ingest without the cache so that the nextclade.tsv will have the same number of columns throughout. |
To prevent this issue from happening in the future, I see a couple options:
We can start with easier option [1] then take our time to design/implement option [2]. |
I'd probably agree that the tidiest thing to do here is to implement both. Re 2: Even though the error may not be so obvious if columns aren't being added/subtracted, we could still adjust how something else is calculated in Nextclade which may lead to discordant data when we're appending files. So I think we should try to hard to set up some kind of system that whenever there's a Nextclade release we do a full Nextstrain metadata re-run. However to avoid column addition/removal from messing up downstream operations we could also implement 1. I would naively imagine that not many are using the somewhat-intermediate nextclade.tsv file rather than the metadata file, but it's possible. I imagine the number is small and hopefully they could contact us if they ran into trouble? |
PS - Really sorry about this (again) @chaoran-chen ! Thanks for flagging it and hope it's sorted now! |
A few more ideas to complement Jover's list:
|
Another possibility, although a downgrade, but is much simpler to implement:
|
Thanks everyone for fixing this so quickly, and don't worry @emmahodcroft! |
This should be doable for us internally, but not sure how much this would affect costs for external users such as Africa CDC on Terra. @j23414 do you happen to have the latest costs of ncov-ingest runs with/without caches on Terra? I know we're pausing the development for ncov-ingest of GISAID on Terra for now, but just wanted keep that in mind when making a decision here. |
Problem is fixed in current nextclade.tsvs (both public and GISAID). Sorry for the trouble and thanks for raising this so quickly @chaoran-chen! I should have thought about this when we added new columns to the nextclade.tsv 🤦 I think the most straightforward way to solve cache inconsistencies is to add extra columns to the
Cache invalidation is then quite easy to automate as we have all we need in the cache. Most of our trouble is that we don't include that info in the cache. Will make a PR. The point you raise about index is a separate one. The index is not meant to be meaningful in our caching case. It's only meaningful if you run nextclade on a single file, mapping the position of the input sequence to the output result. You can just ignore it |
That sounds like a really good solution Cornelius, to put some armour between our foot and our gun 😉 |
While we're at it - we could add these as potential output columns to Nextclade to make it more straightforward to include this optional provenance information for everyone. |
I think that nextclade.tsv of the open dataset is malformed (or wrong in a different way). The
deletions
column contains values for mutations.One way to reproduce the error is using tsv-utils to inspect the
deletions
column:The expected output would be empty.
The same issue is also reported by the CSV/TSV parser that we use for LAPIS (org.apache.commons.csv.CSVParser) causing the preprocessing pipeline for the open LAPIS instance to crash: GenSpectrum/LAPIS#159
The text was updated successfully, but these errors were encountered: