Skip to content

A cost-benefit analysis of cross-lingual transfer methods.

Notifications You must be signed in to change notification settings

unicamp-dl/cross-lingual-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Cross-lingual analysis

This repository contains the translation codes used in the paper A cost-benefit analysis of cross-lingual transfer methods. In this work, we analyze cross-lingual methods on three tasks in terms of their effectiveness (e.g., accuracy), development and deployment costs, as well as their latencies at inference time. We experiment with the following transfer learning techniques: 1) fine-tuning a bilingual model on a source language and evaluating it on the target language without translation, i.e., in a zero-shot manner; 2) automatic translation of the training dataset to the target language; 3) automatic translation of the test set to the source language at inference time and evaluation of a model fine-tuned in English. Finally, by combining zero-shot and translation methods, we achieve the state-of-the-art in two of the three datasets used in this work. The study is a result of an ongoing Master's Program.

Evaluation benchmarks

The models were benchmarked on three tasks (Question Answering, Natural Language Inference and Passage Text Ranking) and compared to previous published results. Metrics are: Exact Match and F1-score for Q&A, Accuracy and F1-score for NLI and MRR@10 for Text Ranking.

Model Pre-train Fine-tune F1 Accuracy
mBERT (Souza et al.) 100 languages ASSIN2 0.8680 0.8680
PTT5 (Carmo et al.) EN & PT ASSIN2 0.8850 0.8860
BERTimbau Large (Souza et al.) EN & PT ASSIN2 0.9000 0.9000
BERT-pt (ours) EN & PT MNLI + ASSIN2 0.9207 0.9207

How to Translate

We made available the following data and the respectives notebooks with translation code:

The datasets SQuAD and MNLI are directly downloaded from the notebooks of this repository. We also provide the FaQuAD and ASSIN2 datasets.

SQuAD FaQuAD MNLI ASSIN2
Training examples 86,288 837 392,702 6,500
Test examples 21,557 63 20,000 2,448
Translate Train (Batch size = 1) 34h - 36h -
Translate Test (Batch size = 1) - 1m 30s - 31m

References

[1] BERTimbau: Pretrained BERT Models for Brazilian Portuguese

[2] PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

How do I cite this work?

 @article{cross-lingual2021,
    title={A cost-benefit analysis of cross-lingual transfer methods},
    author={Moraes, Guilherme and Bonifácio, Luiz Henrique and Rodrigues de Souza, Leandro and Nogueira, Rodrigo and Lotufo, Roberto},
    journal={arXiv preprint arXiv:2105.06813},
    url={https://arxiv.org/abs/2105.06813},
    year={2021}
}

About

A cost-benefit analysis of cross-lingual transfer methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published