This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels and critical error detection. This year we put emphasis on:
This year we will release test-sets on the following language-pairs:
In addition to generally advancing the state of the art in quality estimation, our specific goals are:
For all tasks, the datasets and NMT models that generated the translations will be made publicly available.
Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.
Here are some open source software for QE that might be useful for participants:
For questions regarding the organisations of the task and/or issues with submission to this Codalab, please use the Forum.
The primary evaluation metric will be F1-score and we also plan to report Recall.
Submission Format
The competition will take place on CODALAB.
This year we will also require participants to fill in a form describing their model and data choices for each submission.
For each submission you wish to make (under “Participate>Submit” on codalab), please upload a single zip file with the predictions and the system metadata.
For the metadata, we expect a ‘metadata.txt’ file, with exactly two non-empty lines which are for the teamname and the short system description, respectively. The first line of metadata.txt must contain your team name. You can use your CodaLab username as your teamname. The second line of metadata.txt must contain a short description (2-3 sentences) of the system you used to generate the results. This description will not be shown to other participants. Note that submissions without a description will be invalid. It is fine to use the same submission for multiple submissions/phases if you use the same model (e.g. a multilingual or multitasking model)
For the predictions we describe the exact format expected below:
Line 1: <DISK FOOTRPINT (in bytes, without compression)>
Line 2: <NUMBER OF PARAMETERS>
Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)
Lines 4-n where -n is the number of test samples: <LANGUAGE PAIR> <METHOD NAME> <SEGMENT NUMBER> <TARGET SENTENCE> <ERROR START INDICES> <ERROR END INDICES> <ERROR TYPES>
Where:
Each field should be delimited by a single tab (<\t>) character.
Output example
2409244995
2280000000
3
he-en <\t> example-ensemble <\t> 0 <\t> This is a sample translation without errors. <\t> -1 <\t> -1 <\t> no-error
he-en <\t> example-ensemble <\t> 1 <\t> This is a sample translation with a span that is considered major error and another span that is considered minor error. <\t> 49 97 <\t> 70 118 <\t> major minor …
Start: Aug. 1, 2023, midnight
Start: Aug. 23, 2023, noon
Never
You must be logged in to participate in competitions.
Sign In