Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FT] Enhancing CorpusLevelTranslationMetric with Asian Language Support #478

Open
ryan-minato opened this issue Dec 27, 2024 · 0 comments
Open
Labels
feature request New feature/request

Comments

@ryan-minato
Copy link
Contributor

Issue encountered

While working on several Japanese benchmark tasks, I observed that standard BLEU, CHRF, and TER metrics are suboptimal for Asian languages.
To address this, I propose adding a parameter to CorpusLevelTranslationMetric that allows integration with tokenizers tailored for Asian languages.

Solution/Feature

SacreBLEU already includes tokenizers designed for Asian languages, which lack space-separated words. By modifying the implementation slightly, we can extend CorpusLevelTranslationMetric to better handle these languages.

https://github.com/mjpost/sacrebleu/blob/0f351010b8b641aaa59fe75b98d7cc522bf221eb/sacrebleu/metrics/bleu.py#L110-L208

Possible alternatives

A clear and concise description of any alternative solutions or features you've considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

1 participant