Skip to content

Latest commit

 

History

History
22 lines (15 loc) · 2.13 KB

DATA_STATEMENT.md

File metadata and controls

22 lines (15 loc) · 2.13 KB

EtymDB 2.0 Data Statement

A. Curation rationale

In order to create an etymological relation database, the authors parsed the 2019/10/20 Wiktionary datadump, extracting, with regular expressions, both in-text information based on relation information ("Borrowed from", "Cognate with", "From", ...) and structured information based on the wiki format. Lexical units were grouped by id/lexeme/gloss/language, and duplicate relations or units were merged.

B. Language variety

The extracted data belongs to 2536 languages, the most represented being English (which constitutes 48% of the database with 911,086 lexemes), Latin (69,224 lexemes), French (34,488 lexemes), Italian (31,295 lexemes) and German (27,009). 414 languages are well represented, with more than 100 lexemes, whereas 769 languages only have one lexeme.

C. "Speaker demographic"

The "speakers" of the Wiktionary are its contributors, and there is have access to little metadata concerning them. About 5500 persons have contributed to the data dump used, among which two thirds are anonymous contributors. Since they are editing the English Wiktionary, it seems likely to assume that they speak English as a first or second language.

D. "Curator demographic"

The data was curated by the authors of the paper, both native French speakers bilingual in English, and with basic to professional knowledge of German, Spanish, Italian, Slovak, Polish, Czech, and scholar knowledge of Latin, Ancient Greek, as well as a limited expertise in Indo-European historical linguistics. Since the authors are more familiar with European languages, it is likely that there was a small bias in data correction, as errors for languages of Europe was probably more likely to be detected in the analysis phases.

E. Speech situation N/A

All the information in the used Wiktionary dump has been published between 2003 and 2019 by 5500 different persons.

F. Text Characteristics

The Wiktionary is a collaborative dictionary. As such, the text is highly structured, and goes through multiple edits by various Wikipedian contributors before attaining its final form. All text is public.

G. Recording Quality N/A