[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

ryan-minato · 2024-12-27T07:49:31Z

Issue encountered

While working on several Japanese benchmark tasks, I observed that standard BLEU, CHRF, and TER metrics are suboptimal for Asian languages.
To address this, I propose adding a parameter to CorpusLevelTranslationMetric that allows integration with tokenizers tailored for Asian languages.

Solution/Feature

SacreBLEU already includes tokenizers designed for Asian languages, which lack space-separated words. By modifying the implementation slightly, we can extend CorpusLevelTranslationMetric to better handle these languages.

https://github.com/mjpost/sacrebleu/blob/0f351010b8b641aaa59fe75b98d7cc522bf221eb/sacrebleu/metrics/bleu.py#L110-L208

Possible alternatives

A clear and concise description of any alternative solutions or features you've considered.

The text was updated successfully, but these errors were encountered:

ryan-minato added the feature request New feature/request label Dec 27, 2024

ryan-minato mentioned this issue Dec 27, 2024

feat: add asian language support to CorpusLevelTranslationMetric #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

ryan-minato commented Dec 27, 2024

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

Comments

ryan-minato commented Dec 27, 2024

Issue encountered

Solution/Feature

Possible alternatives