SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials
Part of the 18th International Workshop on Semantic Evaluation.
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks (Brown et al., 2020; Chowdhery et al., 2022). However, they are heavily susceptible to shortcut learning (Geirhos et al., 2020; Poliak et al., 2018; Tsuchiya, 2018), factual inconsistency (Elazar et al., 2021), and performance degradation when exposed to word distribution shifts (Miller et al., 2020; Lee et al., 2020), data transformations (Xing et al., 2020; Stolfo et al., 2022), and adversarial examples (Li et al., 2020). Crucially, these limitations can lead to an overestimation of the real-world performance (Patel et al., 2008; Recht et al., 2019), and are therefore of particular concern in the context of medical applications.
Given the increasing deployment of LLMs in real-world scenarios, we propose a textual entailment task to advance our understanding of models’ behaviour and improve existing evaluation methodologies for clinical Natural Language Inference NLI). Through the systematic application of controlled interventions, each engineered to investigate a specific semantic phenomenon involved in natural language and numerical inference, we investigate the robustness, consistency, and faithfulness of the reasoning performed by LLMs in the clinical setting.
Clinical trials are conducted to assess the effectiveness and safety of new treatments and are crucial for the advancement of experimental medicine (Avis et al., 2006). Clinical Trial Reports (CTR), outline the methodology and findings of a clinical trial. Healthcare professionals use CTRs to design and prescribe experimental treatments. However, with over 400,000 CTRs available, and more being published at an increasing pace (Bastian et al., 2010), it is impractical to conduct a thorough assessment of all relevant literature when designing treatments (DeYoung et al., 2020). NLI (Bowman et al., 2015) presents a possible solution, enabling the large-scale interpretation and retrieval of medical evidence, connecting the most recent findings to facilitate personalized care (Sutton et al., 2020). To advance research at the intersection of NLI and Clinical Trials (NLI4CT), we organised "SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data" (Jullien et al., 2023), see SemEval 2023. Task 7 had an entailment and evidence selection subtask, with 643 submissions from 40 participants, and 364 submissions from 23 participants respectively. While the previous iteration of NLI4CT led to the development of models based on LLMs (Zhou et al., 2023; Kanakarajan and Sankarasubbu, 2023; Vladika and Matthes, 2023) achieving high performance (i.e., f1 score ≈ 85%), the application of LLMs in critical domains, such as real-world clinical trials, requires further investigations accompanied by the development of novel evaluation methodologies grounded in a more systematic behavioural and causal analyses (Wang et al.,2021).
This second iteration is intended to ground NLI4CT in interventional and causal analyses of NLI models (YU et al., 2022), enriching the original NLI4CT dataset with a novel contrast set, developed through the application of a set of interventions on the statements in the NLI4CT test set.
Thanks to the explicit causal relation between the designed interventions and expected labels, The proposed methodology will allow us to explore the following research aims through a causal lens:
• RA1: "To investigate the consistency of NLI models in their representation of semantic phenomena necessary for complex inference in clinical NLI settings"
• RA2: " To investigate the ability of clinical NLI models to perform faithful reasoning, i.e., make correct predictions for the correct reasons."
This task is based on a collection of breast cancer CTRs (extracted from https://clinicaltrials.gov/ct2/home), statements, explanations, and labels annotated by domain expert annotators.
Task: Textual Entailment
For the purpose of the task, we have summarised the collected CTRs into 4 sections:
Eligibility criteria - A set of conditions for patients to be allowed to take part in the clinical trial
Intervention - Information concerning the type, dosage, frequency, and duration of treatments being studied.
Results - Number of participants in the trial, outcome measures, units, and the results.
Adverse events - These are signs and symptoms observed in patients during the clinical trial.
The annotated statements are sentences with an average length of 19.5 tokens, that make some type of claim about the information contained in one of the sections in the CTR premise. The statements may make claims about a single CTR or compare 2 CTRs. The task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. The training set we provide is identical to the training set used in our previous task, however, we have performed a variety of interventions on the test set and development set statements, either preserving or inversing the entailment relations. We will not disclose the technical details adopted to perform the interventions to guarantee fair competition and in the interest of encouraging approaches that are robust and not simply designed to tackle these interventions. The technical details will be made publicly available after the evaluation phase, and in our task description paper.
Intervention targets
Numerical - LLMs still struggle to consistently apply numerical and quantitative reasoning. As NLI4CT requires this type of inference, we will specifically target the models' numerical and quantitative reasoning abilities.
Vocabulary and syntax - Acronyms and aliases are significantly more prevalent in clinical texts than general domain texts, and disrupt the performance of clinical NLI models. Additionally, models may experience shortcut learning, relying on syntactic patterns for inference. We target these concepts and patterns with an intervention.
Semantics - LLMs struggle with complex reasoning tasks when applied to longer premise-hypothesis pairs. We also intervene on the statements to exploit this.
Notes - The specific type of intervention performed on a statement will not be available at test or training time.
Resources on Paper Submission
The evaluation of performance on this task will involve several steps. First, we will assess performance on the original NLI4CT statements without any interventions. This assessment will be based on Macro F1-score.
Next, we will evaluate performance on the contrast set, which includes all statements with interventions. For this evaluation, we will use two new metrics: faithfulness and consistency, which are defined below. The overall ranking of a system will be determined by calculating the average faithfulness and consistency scores across all intervention types.
Faithfulness is a measure of the extent to which a given system arrives at the correct prediction for the correct reason. Intuitively, this is estimated by measuring the ability of a model to correctly change its predictions when exposed to a semantic-altering intervention. Given N statements x_i in the contrast set (C), their respective original statements y_i, and model predictions f().
During the 'practice' phase, the prediction files submitted by participants to the task will be evaluated against the gold practice test set.
During the 'evaluation' phase, the prediction files submitted by participants to the task will be evaluated against the gold test set.
By submitting results to this competition, you consent to the public release of your scores at the SemEval 2023 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
You further agree that your system may be named according to the team name provided at the time of submission, or to a suitable shorthand as determined by the task organizers.
You agree not to use or redistribute the shared task data except in the manner prescribed by its licence.
To encourage transparency and replicability, all teams must publish their code, tuning procedures, and instructions for running their models with their submission of shared task papers.
training_data.zip contains the training set, development set, and CTR files (including test set CTRs).
practice_test.json contains the practice test set, predictions for this set should be submitted to the practice task.
test.json contains the test set, predictions for this set should be submitted to the Semeval 2024 task 2. Coming the 10th January 2024!
We provide a Starter Kit on our GitHub Repo. The Starter Kit can be used to create a baseline system for the task and output the results in the required submission format. Note that this is the same starter kit as for our Task 1 in SemEval 2023.
Task: starter_script.ipynb
Mael Jullien - University of Manchester
Marco Valentino - University of Manchester, Idiap Research Institute
Andre Freitas - University of Manchester, Idiap Research Institute
A sample submission file is available on GitHub
During the 'practice' phase, the prediction files submitted by participants to the task will be evaluated against the gold practice test set.
During the 'evaluation' phase, the prediction files submitted by participants to the task will be evaluated against the gold test set.
Submissions can be made by going to the "Participate" tab, and clicking the "Submit/View Results" tab, then selecting the task you want to submit to.
The script takes one single prediction file as input, which MUST be a .json file, named "results.json", structured as follows:
{{"UUID": {
"Prediction": "Predicted label"
]
}
}
Note: The name of the zip file can be in any format, but the name of the actual 'json' file must be in the format above.
Start: Oct. 20, 2023, midnight
Description: The practice task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. Training data released for this Phase. Train and evaluate your model using the train.json. Submit your results on the practice_test.json set.
Start: Jan. 10, 2024, midnight
Description: The task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. Train and evaluate your model using the train.json and dev.json. Submit your results on the test.json set.
Start: Feb. 1, 2024, midnight
Description: The task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. Train and evaluate your model using the train.json and dev.json. Submit your results on the test.json set.
Oct. 31, 2024, 7 p.m.
You must be logged in to participate in competitions.
Sign In