Skip to content

Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

License

Notifications You must be signed in to change notification settings

Low-ResourceDialectology/TextAsCorpusRep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues Apache License LinkedIn


Logo

TextAsCorpusRep

Multilingual Text As Corpus Repository for Machine Translation of Low-Resource Languages

About The Project

Our project started as an idea to addresses low-resource languages, focusing on Mauritian Creole and the Kurdish dialect Kobani. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.

Logo

Guiding questions are:

  • (Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
  • (Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?

The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project can contribute to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.

Initial Approach

Starting with an ambitious plan made of four phases (see figure below), we sometimes felt like we were only scratching the surface during our one semester long student's project.

Project Name Screen Shot

Nonetheless, we got to learn a lot working on this project and believe to have built a strong foundation for future work to expand on with new translations and annotations.

(back to top)

Collected Languages

Morisien, or Mauritian Creole (mfe)

Kobani (has no ISO-code, we use "kob") a dialect of Kurmanji, which is also known as Northern Kurdish (kmr)

Vietnamese (vie)

Chinese (zho)

(back to top)

Additionally Included Languages

English (eng)

German (deu)

French (fra)

Ukrainian (ukr)

Czech (ces)

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

Christian Schuler - @christianschuler8989 & Homepage - christianschuler8989(4T)gmail.com

Deepesha Saurty - [email protected]

Tramy Thi Tran - @TranyMyy - [email protected]

Raman Ahmad - @RamanAhmad & Homepage

Ānrán Wáng - @AnranW - anran.wang (thesymbolforemail)tum.de

(back to top)

Acknowledgments

A list of helpful resources we would like to give credit to:

(back to top)

About

Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages