MT2023@UdS, guc2esp Translation

Organized by cristinae - Current server time: March 30, 2025, 4:02 a.m. UTC

Previous

Evaluation
Sept. 10, 2023, midnight UTC

Current

Improvements
Sept. 18, 2023, midnight UTC

End

Competition Ends
Never

MT2023@UdS

Wayúunaiki to Spanish translation

Wayúunaiki is the native language spoken in the Wayúu community, located in the Caribbean region connecting Colombia and Venezuela, where the language coexists with Spanish. In this challenge we want to create translation resources for the Wayúu community.

The shared task

Phase I, development: We provide training data extracted from the Tatoeba challenge [1,2]. It mostly belongs to the relogious domain. We substracted a set from the original Tatoeba corpus: the in-domain test set that is going to be used for evaluating your MT engine in this phase. Together with the parallel corpus we provide, you can use any other data you can find to train your MT engine, but please, don't use the original Tatoeba corpus which would also contain the test set. Upload the translation into Spanish of the test set to Codalab. The translation will be evaluated using BLEU, TER, chrF and COMET. chrF is the official evaluation metric of the challenge.

Phase II, evaluation: We provide an out-of-domain test set (general domain) in Wayúunaiki and you will have a week to upload its translation into Spanish to Codalab. The translation will be evaluated using BLEU, TER, chrF and COMET but only chrF is the official evaluation metric of the challenge. Notice that the best system in an in-domain test set is not necessarily the best system in a general domain.

What can you do?

Baseline: You can reproduce the SMT and/or the NMT system developed in class. Finetuning the NMT hyperparameters is relevant

Data: The training data is not clean at all and it is insuficent. Can you improve on that?

Languages: Spanish in this dataset comes from Latin America. Wayuu is an aglutinative language while Spanish is a fusional language. Can you ease the training in such a difficult context?

Advanced: Multilingualism. Any other language pair with data available that can help with joint training? Any close language even monolingually?

What is other people doing?

  • Survey of Low-Resource Machine Translation [3]
  • Shared Task: General Machine Translation at WMT (not low-resourced) [4]
  • AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages (and previous years) [5]

Timeline

10/07/2023: Competition set up

18/07/2023: Competition starts

11/09/2023: Test phase starts

17/09/2023: Test phase ends

20/09/2023: Deadline for filling the form

References

[1] Jörg Tiedemann. 2020. The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online. Association for Computational Linguistics.

[2] https://github.com/Helsinki-NLP/Tatoeba-Challenge

[3] Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, and Alexandra Birch. 2022. Survey of Low-Resource Machine TranslationComputational Linguistics, 48(3):673–732.

[4] https://www.statmt.org/wmt22/translation-task.html

[5] https://turing.iimas.unam.mx/americasnlp/2023_st.html

Submission and Evaluation

The challenge is only evaluated with automatic metrics. We will use BLEU, TER, chrF (sacreBLEU versions) and COMET. chrF is the official evaluation metric. COMET might be misleading as it does not include Wayúu (but it does Spanish). When submitting a run, be patient, COMET runs on CPU. It might take 20 minutes before your results appear in the leaderboard.

Phase I, development: Name your file tatoeba.test.guc2esp.esp and zip it. Upload your zipped translation. Notice that only zip compression works, and non-zipped files will result in an error.

Phase II, evaluation: Name your file general.test.guc2esp.esp and zip it. Upload your zipped translation. Notice that only zip compression works, and non-zipped files will result in an error.

MT23@UdS: Rules

The submissions for Phase I are optional but recommended. You can finetune and compare to other participants here. You may submit 20 submissions every day and 100 in total per team.

The submissions for Phase II are mandatory. You may submit up to 5 submissions any day of the submission week but up to a total of 5 per team. If more are submitted, only the first 5 (not the top performing 5) will count.

IMPORTANT! Follow the naming in the evaluation section for submission.

At the end of the competition you have to fill a form with the caracteristics of your best system according to chrF. Visit the form.

The winner of the competition is the team which achieves the highest chrF score on the general domain test data. The full team gets the 10% bonus this semester.

Development

Start: July 10, 2023, midnight

Description: Development phase: create MT models, translate the in-domain test set and submit the translation.

Evaluation

Start: Sept. 10, 2023, midnight

Description: Final phase: Evaluation on the general domain test.

Improvements

Start: Sept. 18, 2023, midnight

Description: Post-evaluation phase: Anyone can submit new systems (non-graded)

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 androuv 21.79