ClinSpEn is part of the Biomedical WMT 2022 shared task, having the aim to promote the development and evaluation of machine translation systems adapted to the medical domain with three highly relevant sub-tracks: clinical cases, medical controlled vocabularies/ontologies, and clinical terms and entities extracted from medical content.
- ClinSpEn sub-track website: https://temu.bsc.es/clinspen/
- Biomedical WMT 2022 website: https://statmt.org/wmt22/biomedical-translation-task.html
Machine translation applied to the clinical domain is a specially challenging task due to the complexity of medical language and the heavy use of health-related technical terms and medical expressions. Therefore there is a large community of specialized medical translators, able to deal with medical narratives, terminologies or the use of ambiguous abbreviations and acronyms.
Taking into account the relevance, impact and diversity of health-related content, as well as the rapidly growing number of publications, EHRs, clinical trials, informed consent documents and medical terminologies there is a pressing need to be able to generate more robust medical machine translation resources together with independent quality evaluation scenarios.
Recent advances in machine translation technologies together with the use of other NLP components are showing promising results, thus domain adaptation of MT approaches can have a significant impact in unlocking key information from medical content.
The ClinSpEn data represents three different types of data very relevant to the biomedical domain: clinical cases, clinical terminology and ontology concepts.
ClinSpEn is comprised of three different sub-tracks:
The sample, test and background data for each sub-track can be found under the Participate tab.
Participant systems are evaluated for each sub-track individually using five metrics: COMET, METEOR, SacreBLEU, BLEU and ROUGE, with the main one being SacreBLEU. Participants may upload up to 7 predictions for each sub-track.
The evaluation script for all three metrics was kindly shared by the MedMTEval organizers, a competition focused on the automatic translation of medical texts from Russian to English and part of the AINL 2022 conference. For more information, please check their article "E. Ezhergina, M. Fedorova, V. Malykh, and D. Petrova. Findings of Biomedical Russian-English MT Competition. To appear in AINL 2022 Proceedings." Special thanks to Tom Kocmi, one of the developers of the OCELOT evaluation tool, for his support.
All submissions must be done using a ZIP file containing a TSV inside. Depending on the sub-track, the TSV file must include the following columns:
- Sub-track 1 (Clinical Cases): document number, line number, predicted translation.
- Sub-track 2 (Clinical Terms): term number, predicted translation.
- Sub-track 3 (Ontology Concepts): concept number, predicted translation.
Headers may or may not be included for each column.
Please check the submitting instruction document.
Start: Aug. 1, 2022, midnight
Description: EN -> ES Translation of clinical cases, using a collection COVID-19 clinical case reports, plus Background Set. The leaderboard will be shared with the participants the day of the conference.
Start: Aug. 1, 2022, midnight
Description: ES -> EN Translation of clinical terminology, using a collection of parallel terms obtained from biomedical literature and electronic health records, plus Background Set. The leaderboard will be shared with the participants the day of the conference.
Start: Aug. 1, 2022, midnight
Description: EN -> ES Translation of a collection of parallel concepts obtained from different biomedical ontologies, plus Background Set. The leaderboard will be shared with the participants the day of the conference.
Sept. 1, 2023, 11 p.m.
You must be logged in to participate in competitions.
Sign In