GitHub - Low-ResourceDialectology/TextAsCorpusRep: Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

TextAsCorpusRep

Multilingual Text As Corpus Repository for Machine Translation of Low-Resource Languages

About The Project

Our project started as an idea to addresses low-resource languages, focusing on Mauritian Creole and the Kurdish dialect Kobani. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.

Guiding questions are:

(Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
(Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?

The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project can contribute to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.

Initial Approach

Starting with an ambitious plan made of four phases (see figure below), we sometimes felt like we were only scratching the surface during our one semester long student's project.

Nonetheless, we got to learn a lot working on this project and believe to have built a strong foundation for future work to expand on with new translations and annotations.

(back to top)

Collected Languages

Morisien, or Mauritian Creole (mfe)

Kobani (has no ISO-code, we use "kob") a dialect of Kurmanji, which is also known as Northern Kurdish (kmr)

Vietnamese (vie)

Chinese (zho)

(back to top)

Additionally Included Languages

English (eng)

German (deu)

French (fra)

Ukrainian (ukr)

Czech (ces)

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

Christian Schuler - @christianschuler8989 & Homepage - christianschuler8989(4T)gmail.com

Deepesha Saurty - [email protected]

Tramy Thi Tran - @TranyMyy - [email protected]

Raman Ahmad - @RamanAhmad & Homepage

Ānrán Wáng - @AnranW - anran.wang (thesymbolforemail)tum.de

(back to top)

Acknowledgments

Digital and Data Literacy in Teaching Lab who funded this student's project with 10,000€.

A list of helpful resources we would like to give credit to:

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
corpus		corpus
corpus_information		corpus_information
docs		docs
images		images
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextAsCorpusRep

About The Project

Initial Approach

Collected Languages

Morisien, or Mauritian Creole (mfe)

Kobani (has no ISO-code, we use "kob") a dialect of Kurmanji, which is also known as Northern Kurdish (kmr)

Vietnamese (vie)

Chinese (zho)

Additionally Included Languages

English (eng)

German (deu)

French (fra)

Ukrainian (ukr)

Czech (ces)

License

Contact

Acknowledgments

About

Releases

Packages

Contributors 5

Languages

License

Low-ResourceDialectology/TextAsCorpusRep

Folders and files

Latest commit

History

Repository files navigation

TextAsCorpusRep

About The Project

Initial Approach

Collected Languages

Morisien, or Mauritian Creole (mfe)

Kobani (has no ISO-code, we use "kob") a dialect of Kurmanji, which is also known as Northern Kurdish (kmr)

Vietnamese (vie)

Chinese (zho)

Additionally Included Languages

English (eng)

German (deu)

French (fra)

Ukrainian (ukr)

Czech (ces)

License

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages