You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are major errors (words missing, punctuation in the wrong order) in sentences derived from the TIGER treebank. Something seems to have gone terribly wrong in pre-processing steps or in the dependency conversion process.
Looking for nearly identical sentences in UD and TIGER, I can find around 1450 sentences that appear to come from TIGER and about 740 of these contain errors in the source text. Approximately 310 only concern punctuation, while the remaining ~430 additionally involve missing sentence-initial words.
The errors are not equally distributed across the UD subcorpora. They affect 2% of the training corpus, 17% of the development corpus, and 29% (!) of the test corpus. A full half of the sentences with missing initial words are in the test corpus, which means that 22% of sentences in the test corpus are missing the first word in the sentence.
The problems are described in detail below, but here is a quick summary of the distribution of errors:
Subcorpus
Punctuation Only
Missing Words (and maybe also Punct.)
Train
162
131
Dev
80
85
Test
63
220
Because the errors involve missing and misordered tokens, fixing things would require a fair amount of reannotation. I don't know what is reasonable to do/expect within the constraints of the UD project and obviously some annotation errors and noise are expected in any corpus, but this seems egregious, especially to this degree in the dev/test corpora.
These kinds of artificially ill-formed sentences do not really seem to be representative of German, which is concerning when the UD corpora are being used more and more for development and evaluation. I would at a minimum propose marking the problematic sentences somehow, especially the ones with missing words, so that developers can exclude them as desired.
Problems
The problems I've found:
The first token is missing in many sentences
The order of adjacent sentence-internal punctuation tokens is reversed
Hyphens from compounds have been converted to --
ASCII double quote " (a character that does not appear in TIGER*) is added at the beginning and/or end of a full sentence (where often the first word is missing, too) or appears as a normalization of `` sentence-initially (almost exclusively in the test corpus)
Ordinal numbers are split incorrectly (or at the very least inconsistently) into two tokens (e.g., 22 . Oktober)
Examples
Here is a sentence that shows problems 1-3 (train-s2181):
Chef Andy Grove sieht die größte Herausforderung darin `` , alles zu
tun , um die Zahl der Nutzer in der PC -- Welt zu steigern '' .
The original sentence from TIGER is:
Ihr Chef Andy Grove sieht die größte Herausforderung darin , `` alles zu
tun , um die Zahl der Nutzer in der PC-Welt zu steigern '' .
Ihr is missing
, `` is reversed
PC-Welt becomes PC -- Welt
Here is another sentence (test-s544) with problems 3-4:
" verletzt wurde eine Korrespondentin des deutschen ARD -- Fernsehens .
And the original TIGER sentence:
Leicht verletzt wurde eine Korrespondentin des deutschen ARD-Fernsehens .
As you would expect, the missing words often lead to sentences that are not well-formed (test-s545):
" Verletzungen können Zahlungen und Handelserleichterungen künftig
ausgesetzt werden .
Instead of:
Bei Verletzungen können Zahlungen und Handelserleichterungen künftig
ausgesetzt werden .
(Verletzungen is annotated as dep and has no morphological features.)
And the even more entertaining (test-s374):
deutscher Touristin muß lebenslang in Haft
Instead of:
Mörder deutscher Touristin muß lebenslang in Haft
(Touristin is indeed annotated as nsubj, deutscher and Touristin are Case=Nom, and deutscher is somehow Degree=cmp,pos!)
Detailed Results
After normalizing emdash "--" vs. "-", ignoring cases that result in matched rather than mismatched quotes, and skipping full sentences that were merely embedded in longer sentences, I have found:
302 sentences that differ only in punctuation presence, appearance, and/or order
432 sentences that differ in initial words (and maybe also punctuation)
10 sentences that differ in final or both initial and final tokens (and maybe also punctuation)
I've attached a summary of the mismatches with the following columns:
I've removed a number of cases by hand that were accidentally caught by my simple heuristics or that didn't seem problematic (typically a full sentence from a quote within a longer sentence, with an initial list numbering or dash, or with an intro like Auch: or FR: or Richter: or Klartext:). I've left a few cases in categories 1-3 where there are differences in punctuation within an embedded sentence (so they are more like category 0 in effect, which is reflected in the counts in the table in the introduction). I would not be surprised if there are still some errors in this list, either cases that are not problematic or cases from TIGER that I didn't detect.
*To be accurate: ASCII double quotes do appear a few times in TIGER, but they look like mistakes.
Reinserting FixTigerDep comments for APPRART insertions that still need
to be reannotated. Additionally a few internal marks on inserted tokens
and lemmas have been removed as intended after manual
inspection/updates.
There are major errors (words missing, punctuation in the wrong order) in sentences derived from the TIGER treebank. Something seems to have gone terribly wrong in pre-processing steps or in the dependency conversion process.
Looking for nearly identical sentences in UD and TIGER, I can find around 1450 sentences that appear to come from TIGER and about 740 of these contain errors in the source text. Approximately 310 only concern punctuation, while the remaining ~430 additionally involve missing sentence-initial words.
The errors are not equally distributed across the UD subcorpora. They affect 2% of the training corpus, 17% of the development corpus, and 29% (!) of the test corpus. A full half of the sentences with missing initial words are in the test corpus, which means that 22% of sentences in the test corpus are missing the first word in the sentence.
The problems are described in detail below, but here is a quick summary of the distribution of errors:
Because the errors involve missing and misordered tokens, fixing things would require a fair amount of reannotation. I don't know what is reasonable to do/expect within the constraints of the UD project and obviously some annotation errors and noise are expected in any corpus, but this seems egregious, especially to this degree in the dev/test corpora.
These kinds of artificially ill-formed sentences do not really seem to be representative of German, which is concerning when the UD corpora are being used more and more for development and evaluation. I would at a minimum propose marking the problematic sentences somehow, especially the ones with missing words, so that developers can exclude them as desired.
Problems
The problems I've found:
--
"
(a character that does not appear in TIGER*) is added at the beginning and/or end of a full sentence (where often the first word is missing, too) or appears as a normalization of``
sentence-initially (almost exclusively in the test corpus)22 . Oktober
)Examples
Here is a sentence that shows problems 1-3 (train-s2181):
The original sentence from TIGER is:
Ihr
is missing, ``
is reversedPC-Welt
becomesPC -- Welt
Here is another sentence (test-s544) with problems 3-4:
And the original TIGER sentence:
As you would expect, the missing words often lead to sentences that are not well-formed (test-s545):
Instead of:
(
Verletzungen
is annotated asdep
and has no morphological features.)And the even more entertaining (test-s374):
Instead of:
(
Touristin
is indeed annotated asnsubj
,deutscher
andTouristin
areCase=Nom
, anddeutscher
is somehowDegree=cmp,pos
!)Detailed Results
After normalizing emdash "--" vs. "-", ignoring cases that result in matched rather than mismatched quotes, and skipping full sentences that were merely embedded in longer sentences, I have found:
I've attached a summary of the mismatches with the following columns:
I've removed a number of cases by hand that were accidentally caught by my simple heuristics or that didn't seem problematic (typically a full sentence from a quote within a longer sentence, with an initial list numbering or dash, or with an intro like
Auch:
orFR:
orRichter:
orKlartext:
). I've left a few cases in categories 1-3 where there are differences in punctuation within an embedded sentence (so they are more like category 0 in effect, which is reflected in the counts in the table in the introduction). I would not be surprised if there are still some errors in this list, either cases that are not problematic or cases from TIGER that I didn't detect.*To be accurate: ASCII double quotes do appear a few times in TIGER, but they look like mistakes.
ud-tiger-misalignments.csv.txt
The text was updated successfully, but these errors were encountered: