Malformed/wrong nextclade.tsv #392

chaoran-chen · 2023-03-01T17:47:24Z

I think that nextclade.tsv of the open dataset is malformed (or wrong in a different way). The deletions column contains values for mutations.

One way to reproduce the error is using tsv-utils to inspect the deletions column:

➜  tsv-select -H -f deletions nextclade.tsv | grep C241T
C241T,G443A,T670G,C2790T,T2954C,C3037T,C3096T,G4184A,C4321T,C8956T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,C16733T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,C21618T,C21714T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23758A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
C241T,T670G,C2790T,T2954C,C3037T,G4184A,C4321T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,A21417G,C21618T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
C241T,T469C,T670G,C2790T,T2954C,C3037T,G4184A,C4321T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,A14424C,C15714T,G16935A,A17039G,C17410T,A18163G,G18498A,C19955T,A20055G,C21618T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,T22942A,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,A27571G,C27807T,C27889T,T27910C,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C
[truncated]

The expected output would be empty.

The same issue is also reported by the CSV/TSV parser that we use for LAPIS (org.apache.commons.csv.CSVParser) causing the preprocessing pipeline for the open LAPIS instance to crash: GenSpectrum/LAPIS#159

The text was updated successfully, but these errors were encountered:

chaoran-chen · 2023-03-01T18:05:21Z

Oh, and there are entries with the same index - that's not correct, is it? E.g., index 2 exists twice. The second entry has 73 tabs instead of 71.

➜  tsv-filter -H --eq index:2 nextclade.tsv
index	seqName	clade	Nextclade_pango	partiallyAliased	clade_nextstrain	clade_who	clade_legacy	qc.overallScore	qc.overallStatus	totalSubstitutions	totalDeletions	totalInsertions	totalFrameShifts	totalMissing	totalNonACGTNs	totalAminoacidSubstitutions	totalAminoacidDeletions	totalAminoacidInsertions	alignmentScore	alignmentStart	alignmentEnd	coverage	isReverseComplement	substitutions	deletions	insertions	frameShifts	aaSubstitutions	aaDeletions	aaInsertions	privateNucMutations.reversionSubstitutions	privateNucMutations.labeledSubstitutions	privateNucMutations.unlabeledSubstitutions	privateNucMutations.totalReversionSubstitutions	privateNucMutations.totalLabeledSubstitutions	privateNucMutations.totalUnlabeledSubstitutions	privateNucMutations.totalPrivateSubstitutions	missing	nonACGTNs	qc.missingData.missingDataThreshold	qc.missingData.score	qc.missingData.status	qc.missingData.totalMissing	qc.mixedSites.mixedSitesThreshold	qc.mixedSites.score	qc.mixedSites.status	qc.mixedSites.totalMixedSites	qc.privateMutations.cutoff	qc.privateMutations.excess	qc.privateMutations.score	qc.privateMutations.status	qc.privateMutations.total	qc.snpClusters.clusteredSNPs	qc.snpClusters.score	qc.snpClusters.status	qc.snpClusters.totalSNPs	qc.frameShifts.frameShifts	qc.frameShifts.totalFrameShifts	qc.frameShifts.frameShiftsIgnored	qc.frameShifts.totalFrameShiftsIgnored	qc.frameShifts.score	qc.frameShifts.status	qc.stopCodons.stopCodons	qc.stopCodons.totalStopCodons	qc.stopCodons.score	qc.stopCodons.status	totalPcrPrimerChanges	pcrPrimerChanges	failedGenes	warnings	errors
2	USA/FL-CDC-LC0174858/2021	21J	AY.25.1	B.1.617.2.25.1	21J	Delta	21J (Delta)	0	good	39	13	0	0	0	0	34	4	0	89124	39	29819	0.9958867003310704	false	G210T,C241T,C3037T,A3948G,G4181T,C6402T,C7124T,C8986T,G9053T,C10029T,A11201G,A11332G,G11562T,C14408T,G15451A,C16466T,C19220T,C21618G,G21987A,T22917G,C22995A,C23277T,A23403G,C23604G,C23921A,G24410A,C25469T,G26107C,T26767C,A27507C,T27638C,C27752T,C27874T,C28045T,A28461G,G28881T,G28916T,G29402T,G29742T	22029-22034,28248-28253,28271			M:I82T,N:D63G,N:R203M,N:G215C,N:D377Y,ORF1a:D1228G,ORF1a:A1306S,ORF1a:P2046L,ORF1a:P2287S,ORF1a:V2930L,ORF1a:T3255I,ORF1a:T3646A,ORF1a:C3766F,ORF1b:P314L,ORF1b:G662S,ORF1b:P1000L,ORF1b:A1918V,ORF3a:S26L,ORF3a:E239Q,ORF7a:V82A,ORF7a:T120I,ORF7b:T40I,ORF8:A51V,ORF9b:T60A,S:T19R,S:G142D,S:R158G,S:L452R,S:T478K,S:T572I,S:D614G,S:P681R,S:Q787K,S:D950N	ORF8:D119-,ORF8:F120-,S:E156-,S:F157-				C23277T,C23921A,C28045T	0	0	3	3			3000	0	good	0	10	0	good	0	24	-5	0	good	3		0	good	0		0		0	0	good		0	0	good	2	Charité_RdRp_F:G15451A,ChinaCDC_N_F:G28881T			
2	IMS-10594-CVDP-EB0C1053-2A0F-4426-A1FD-779CB3C43C4A	22E	BQ.1.1	BA.5.3.1.1.1.1.1.1	22E	Omicron	22E (Omicron)	48.444444	mediocre	77	59	0	0	255	2	59	14	0	69	89188	1	29903	0.9914055445941878	false	C241T,G443A,T670G,C2790T,T2954C,C3037T,C3096T,G4184A,C4321T,C8956T,C9344T,A9424G,C9534T,C10029T,C10198T,G10447A,C10449A,C11750T,G12160A,C12880T,T14257C,C14408T,C15714T,C16733T,G16935A,A17039G,C17410T,A18163G,C19955T,A20055G,C21618T,C21714T,G21987A,T22200G,G22578A,G22599C,C22674T,T22679C,C22686T,A22688G,G22775A,G22813T,T22882G,A22893C,T22917G,G22992A,C22995A,A23013C,T23018G,A23055G,A23063T,T23075C,A23403G,C23525T,T23599G,C23604A,C23758A,C23854A,G23948T,A24424T,T24469A,C25000T,C25584T,C26060T,G26529A,C26577G,G26709A,C27807T,C27889T,A28271T,C28311T,C28312T,G28681T,G28881A,G28882A,G28883C,A29510C	11288-11296,21633-21641,21765-21770,28362-28370,29734-29759			M:D3N,M:Q19E,M:A63T,N:P13L,N:E136D,N:R203K,N:G204R,N:S413R,ORF1a:V60I,ORF1a:S135R,ORF1a:T842I,ORF1a:S944L,ORF1a:G1307S,ORF1a:L3027F,ORF1a:T3090I,ORF1a:T3255I,ORF1a:P3395H,ORF1a:L3829F,ORF1b:Y264H,ORF1b:P314L,ORF1b:S1089L,ORF1b:M1156I,ORF1b:N1191S,ORF1b:R1315C,ORF1b:I1566V,ORF1b:T2163I,ORF3a:T223I,ORF9b:P10F,S:T19I,S:A27S,S:T51I,S:G142D,S:V213G,S:G339D,S:R346T,S:S371F,S:S373P,S:S375F,S:T376A,S:D405N,S:K417N,S:N440K,S:K444T,S:L452R,S:S477N,S:T478K,S:E484A,S:F486V,S:Q498R,S:N501Y,S:Y505H,S:D614G,S:H655Y,S:N679K,S:P681H,S:N764K,S:D796Y,S:Q954H,S:N969K	N:E31-,N:R32-,N:S33-,ORF1a:S3675-,ORF1a:G3676-,ORF1a:F3677-,ORF9b:E27-,ORF9b:N28-,ORF9b:A29-,S:L24-,S:P25-,S:P26-,S:H69-,S:V70-		A1931C,C22786A,T26270C		G443A,C3096T,C8956T,C16733T,C21714T,C23758A	3	0	6	9	13,22942,26965-27160,29847-29903	E:9,M:148-213,ORF1a:556,S:460	M:1931,Y:26270	3000	0	good	255	10	20	good	2	24	16	66.666667	mediocre	24		0	good	0		0		0	0	good		0	0	good	6	ChinaCDC_N_F:G28881A;G28882A;G28883C,USCDC_N1_P:C28311T;C28312T,USCDC_N3_F:G28681T

joverlee521 · 2023-03-01T19:23:57Z

Thanks for reporting this issue @chaoran-chen!

This is most likely due to the release of Nextclade 2.12.0 yesterday that added new columns to the CSV/TSV output.

Our current cache system blindly concatenates new Nextclade runs to the existing cache so the mismatch in the number of column would lead to the malformed nextclade TSV.

I'm going to re-run our ingest without the cache so that the nextclade.tsv will have the same number of columns throughout.

joverlee521 · 2023-03-01T20:05:25Z

To prevent this issue from happening in the future, I see a couple options:

Use the --output-columns-selection option for Nextclade to be explicit about which columns to include in our output.
For our pipeline, we would only need the columns we join with the metadata. However, it's unclear to me if outside users depend on the other columns in the nextclade.tsv. We could just start with all of the columns that are currently in the nextclade.tsv (after today's no cache run).
Automate the pipeline to run without cache whenever there's a new version of Nextclade. This would fit in with the ideas discussed in Add Nextclade dataset version tracking file #390.

We can start with easier option [1] then take our time to design/implement option [2].

emmahodcroft · 2023-03-02T08:22:12Z

I'd probably agree that the tidiest thing to do here is to implement both.

Re 2: Even though the error may not be so obvious if columns aren't being added/subtracted, we could still adjust how something else is calculated in Nextclade which may lead to discordant data when we're appending files. So I think we should try to hard to set up some kind of system that whenever there's a Nextclade release we do a full Nextstrain metadata re-run.

However to avoid column addition/removal from messing up downstream operations we could also implement 1. I would naively imagine that not many are using the somewhat-intermediate nextclade.tsv file rather than the metadata file, but it's possible. I imagine the number is small and hopefully they could contact us if they ran into trouble?

emmahodcroft · 2023-03-02T08:22:47Z

PS - Really sorry about this (again) @chaoran-chen ! Thanks for flagging it and hope it's sorted now!

ivan-aksamentov · 2023-03-02T08:40:54Z

A few more ideas to complement Jover's list:

Freeze Nextclade version, and automate updates using bot-submitted, manually-approved PRs. These PRs would only be merged after reading Nextclade's changelog.
Automate cache invalidation by listening to Nextclade releases and creating the touchfiles

ivan-aksamentov · 2023-03-02T08:48:14Z

Another possibility, although a downgrade, but is much simpler to implement:

Remove the caching and merging things and always do full runs. Nextclade v2 is so much faster than v1, that caching is not that much important. If I remember, downloading and decompressing the cache also takes quite some time. It might be that after removing the cache the difference will not be such big of a deal. Bit it would simplify the pipeline and make it a bit less fragile.

chaoran-chen · 2023-03-02T08:52:15Z

Thanks everyone for fixing this so quickly, and don't worry @emmahodcroft!

joverlee521 · 2023-03-02T22:53:05Z

Remove the caching and merging things and always do full runs.

This should be doable for us internally, but not sure how much this would affect costs for external users such as Africa CDC on Terra.

@j23414 do you happen to have the latest costs of ncov-ingest runs with/without caches on Terra? I know we're pausing the development for ncov-ingest of GISAID on Terra for now, but just wanted keep that in mind when making a decision here.

corneliusroemer · 2023-03-06T14:34:29Z

Problem is fixed in current nextclade.tsvs (both public and GISAID). Sorry for the trouble and thanks for raising this so quickly @chaoran-chen!

I should have thought about this when we added new columns to the nextclade.tsv 🤦

I think the most straightforward way to solve cache inconsistencies is to add extra columns to the nextclade produced output before merging/appending to cache:

nextclade version
dataset version

Cache invalidation is then quite easy to automate as we have all we need in the cache. Most of our trouble is that we don't include that info in the cache. Will make a PR.

The point you raise about index is a separate one. The index is not meant to be meaningful in our caching case. It's only meaningful if you run nextclade on a single file, mapping the position of the input sequence to the output result. You can just ignore it

emmahodcroft · 2023-03-06T14:35:57Z

That sounds like a really good solution Cornelius, to put some armour between our foot and our gun 😉
Thanks!

corneliusroemer · 2023-03-06T14:38:35Z

While we're at it - we could add these as potential output columns to Nextclade to make it more straightforward to include this optional provenance information for everyone.

chaoran-chen added the bug Something isn't working label Mar 1, 2023

chaoran-chen mentioned this issue Mar 1, 2023

LAPIS open preprocessing pipeline is crashing GenSpectrum/LAPIS#159

Closed

nextstrain-bot added this to Nextstrain planning (archived) Mar 2, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Mar 2, 2023

corneliusroemer closed this as completed Mar 6, 2023

github-project-automation bot moved this from New to Done in Nextstrain planning (archived) Mar 6, 2023

joverlee521 mentioned this issue Mar 6, 2023

feat: add nextclade (dataset) version info to nextclade.tsv #393

Merged

1 task

joverlee521 mentioned this issue Jul 8, 2024

Ignore Nextclade cache on new Nextclade version or dataset version #457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed/wrong nextclade.tsv #392

Malformed/wrong nextclade.tsv #392

chaoran-chen commented Mar 1, 2023 •

edited

Loading

chaoran-chen commented Mar 1, 2023

joverlee521 commented Mar 1, 2023

joverlee521 commented Mar 1, 2023

emmahodcroft commented Mar 2, 2023

emmahodcroft commented Mar 2, 2023

ivan-aksamentov commented Mar 2, 2023

ivan-aksamentov commented Mar 2, 2023

chaoran-chen commented Mar 2, 2023

joverlee521 commented Mar 2, 2023

corneliusroemer commented Mar 6, 2023

emmahodcroft commented Mar 6, 2023

corneliusroemer commented Mar 6, 2023

Malformed/wrong nextclade.tsv #392

Malformed/wrong nextclade.tsv #392

Comments

chaoran-chen commented Mar 1, 2023 • edited Loading

chaoran-chen commented Mar 1, 2023

joverlee521 commented Mar 1, 2023

joverlee521 commented Mar 1, 2023

emmahodcroft commented Mar 2, 2023

emmahodcroft commented Mar 2, 2023

ivan-aksamentov commented Mar 2, 2023

ivan-aksamentov commented Mar 2, 2023

chaoran-chen commented Mar 2, 2023

joverlee521 commented Mar 2, 2023

corneliusroemer commented Mar 6, 2023

emmahodcroft commented Mar 6, 2023

corneliusroemer commented Mar 6, 2023

chaoran-chen commented Mar 1, 2023 •

edited

Loading