Our project started as an idea to addresses low-resource languages, focusing on Mauritian Creole and the Kurdish dialect Kobani. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.
Guiding questions are:
- (Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
- (Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?
The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project can contribute to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.
Starting with an ambitious plan made of four phases (see figure below), we sometimes felt like we were only scratching the surface during our one semester long student's project.
Nonetheless, we got to learn a lot working on this project and believe to have built a strong foundation for future work to expand on with new translations and annotations.
Kobani (has no ISO-code, we use "kob") a dialect of Kurmanji, which is also known as Northern Kurdish (kmr)
Distributed under the Apache License. See LICENSE.txt
for more information.
Christian Schuler - @christianschuler8989 & Homepage - christianschuler8989(4T)gmail.com
Deepesha Saurty - [email protected]
Tramy Thi Tran - @TranyMyy - [email protected]
Raman Ahmad - @RamanAhmad & Homepage
Ānrán Wáng - @AnranW - anran.wang (thesymbolforemail)tum.de
- Digital and Data Literacy in Teaching Lab who funded this student's project with 10,000€.
A list of helpful resources we would like to give credit to:
- Best-README-Template
- Potato: the POrtable Text Annotation TOol
- Language Identification with Support for More Than 2000 Labels
- NLLB as part of Fairseq
- NLLB Seed Data
- (Young et al., 2014) Flickr30k
- (Saichyshyna et al., 2023) Extension Multi30K: Multimodal Dataset for Integrated Vision and Language Research in Ukrainian
- (Xie et al., 2023) CCMB: A Large-scale Chinese Cross-modal Benchmark
- (Elliot et al., 2017) Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description