QE shared task 2023 Task 1 - Word-level

Organized by fblain - Current server time: March 29, 2025, 11:45 p.m. UTC

Previous

Competition
Aug. 1, 2023, midnight UTC

Current

Post-Competition
Aug. 18, 2023, noon UTC

End

Competition Ends
Never

This shared task focuses on automatic methods for estimating the quality of neural machine translation output at run-time, without relying on reference translations. It will cover estimation at sentence and word levels and critical error detection. This year we put emphasis on:

  • Medium- and low-resource language pairs
  • New fine-grained quality estimation shared task
  • Increased alignment with the Metrics shared task to facilitate cross-submissions and multi-task approaches
    • Shared translation sources
    • Shared annotation schema [MQM]
    • Shared submission platform [CODALAB]
  • Increased alignment with the APE shared task to facilitate cross-submission and multi-task approaches
    • Shared sources - translations - post-edits (English-Marathi)
  • Incorporated critical error detection tasks
  • Incorporated zero-shot tasks

Language pairs covered

This year we will release test-sets on the following language-pairs:

  • English-German (En-De) [MQM]
  • Chinese-English (Zh-En) [MQM]
  • Hebrew-English (He-Ee) [MQM]
  • English-Marathi (En-Mr) [DA + post-edits]
  • English-Hindi (En-Hi) [DA + post-edits]
  • English-Tamil (En-Ta) [DA + post-edits]
  • English-Telegu (En-Te) [DA]
  • English-Gujarati (En-Gu) [DA]
  • English-Farsi (En-Fa) [post-edits]

Goals

In addition to generally advancing the state of the art in quality estimation, our specific goals are:

  • to extend the available public benchmark datasets with medium- and low-resource languages;
  • to investigate the potential of fine-graned quality estimation;
  • to investigate new multilingual and language independent approaches esp. for zero-shot approaches; and
  • to study the robustness of QE approaches

For all tasks, the datasets and NMT models that generated the translations will be made publicly available.

Participants are also allowed to explore any additional data and resources deemed relevant. Below are the three QE tasks addressing these goals.

Subtasks:

  1. Quality estimation
  2. Fine-grained error span detection

Useful Software

Here are some open source software for QE that might be useful for participants:

Organization

For questions regarding the organisations of the task and/or issues with submission to this Codalab, please use the Forum.

  • Chrysoula Zerva (Instituto de Telecomunicações)
  • André Martins (Instituto de Telecomunicações, Unbabel)
  • Frédéric Blain (Tilburg University)
  • Ricardo Rei (INESC-ID, Unbabel)
  • José Souza (Unbabel)
  • Diptesh Kanojia (Surrey Institute for People-Centred AI, University of Surrey)
  • Constantin Orasan (University of Surrey)
  • Nuno Guerreiro (Instituto de Telecomunicações)
  • Fatemeh Azadi (University Of Tehran)

We will use Matthews Correlation Coefficient as primary metric and also compute F1-Score as secondary metric.

Submission Format

The competition will take place on CODALAB.

This year we will also require participants to fill in a form describing their model and data choices for each submission.

For each submission you wish to make (under “Participate>Submit” on codalab), please upload a single zip file with the predictions and the system metadata.

For the metadata, we expect a ‘metadata.txt’ file, with exactly two non-empty lines which are for the teamname and the short system description, respectively. The first line of metadata.txt must contain your team name. You can use your CodaLab username as your teamname. The second line of metadata.txt must contain a short description (2-3 sentences) of the system you used to generate the results. This description will not be shown to other participants. Note that submissions without a description will be invalid. It is fine to use the same submission for multiple submissions/phases if you use the same model (e.g. a multilingual or multitasking model)

For the predictions we describe the exact format expected separately for each subtask:

Word-level

For the predictions we expect a single TSV file for each submitted QE system output (submitted online in the respective codalab competition), named ‘predictions.txt’.

You can submit different systems for any of the MQM or post-edited language pairs independently. The output of your system should be the predicted word-level tags, formatted in the following way:

Line 1: <DISK FOOTRPINT (in bytes, without compression)>

Line 2: <NUMBER OF PARAMETERS>

Line 3: <NUMBER OF ENSEMBLED MODELS> (set to 1 if there is no ensemble)

Lines 4-n where -n is the total number of tokens (words) in the test samples: <LANGUAGE PAIR> <METHOD NAME> <TYPE> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE>

Where:

  • LANGUAGE PAIR is the ID (e.g., en-de) of the language pair.
  • METHOD NAME is the name of your quality estimation method.
  • TYPE should contain ‘MT’ for all segments.
  • SEGMENT NUMBER is the line number of the plain text translation file you are scoring (starting at 0).
  • WORD INDEX is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0). This will be the word index within the MT sentence.
  • WORD actual word or <EOS> token
  • BINARY SCORE is either ‘OK’ for no issue or ‘BAD’ for any issue.

Each field should be delimited by a single tab (<\t>) character.

Competition

Start: Aug. 1, 2023, midnight

Post-Competition

Start: Aug. 18, 2023, noon

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In