The official competition is now concluded. This page is no longer mantained. For any update visit our web-page.
The overview paper describing the LeWiDi task is now available! Here
The complete datasets have now been released, including the test set labels, and are available in the download section of Codalab or in our github.
The official leaderboard is available here.
Online the video of the presentation of the task
Overview
In recent years, the assumption that natural language (NL) expressions have a single and clearly identifiable interpretation in a given context is more and more recognized as just a convenient idealization. The objective of the Learning with Disagreement shared task is to provide a unified testing framework for learning from disagreements, using datasets containing information about disagreements for interpreting language.
Focus of this task is entirely on subjective tasks, where training with aggregated labels makes much less sense. To this end, we collected a benchmark of four (textual) datasets with different characteristics, in terms of genres (social media and conversations), languages (English and Arabic), tasks (misogyny, hate speech, offensiveness detection) and annotations' methodology (experts, specific demographic groups, AMT-crowd). But all datasets providing a multiplicity of labels for each instance.
The Le-Wi-Di dataset consists of 4 existing datasets that have been harmonized in their form of presentation, through the development of a common json that emphasises their commonalities. The four datasets presented are:
- The "HS-Brexit" dataset (Ahktar et al., 2021): an entirely new dataset of tweets on Abusive Language on Brexit and annotated for hate speech (HS), aggressiveness and offensiveness by six annotators belonging to two distinct groups: a target group of three Muslim immigrants in the UK, and a control group of three other individuals.- The"ArMIS" dataset (Almanea et al., 2022): a dataset of Arabic tweets annotated for misogyny and sexism detection by annotators with different demographics characteristics ("Moderate Female", "Liberal Female" and "Conservative Male"). This dataset is new.
- The "ConvAbuse" dataset (Cercas Curry et al., 2021): a dataset of around 4,000 English dialogues conducted between users and two conversational agents. The user utterances have been annotated by at least three experts in gender studies using a hierarchical labelling scheme (following categories: Abuse binary, Abuse severity; Directedness; Target; Type).
- The "MultiDomain Agreement" dataset (Leonardelli et al. 2021): a dataset of around 10k English tweets from three domains (BLM, Election, Covid-19). Each tweet is annotated for offensiveness by 5 annotators via AMT. Particular focus was put on pre-selecting tweets to be annotated that are potentially leading to disagreement. Indeed, almost 1/3 of the dataset has then been annotated with a 2 vs 3 annotators disagreement, and another third of the dataset has an agreement of 1 vs 4. 🚨NEW You can find a video explaining this dataset here: video 🚨
We encourage participants in developing methods able to capture agreements/disagreements, rather than focusing on developing the best model. This is why, to distribute the data, we developed a harmonized json format used across all datasets, that emphasizes shared features of datasets, while maintainig their diversities. In this manner, information that is in common to all datasets, is released and presented in a homogenous format, so to facilitate participants in testing their methods across all the datasets.
Among the information released that is common to all datasets, and of particular relevance for the task, are the disaggregated crowd-annotations labels and the annotators' reference. Moreover, dataset-specific information is also released, and vary for each dataset, from demographics information of annotators (ArMIS and HS-Brexit datasets), to other annotations made by the same annotators within the same dataset (all datasets) and additional annotations given for for the same item by the same annotator (HS-Brexit and ConvAbuse datasets). Participants can leverage on this dataset-specific information to improve perfomance for a specific dataset.
For more information on the format of the data, see the section "Data format" and download a sample of the data from section "Get data sample".
Particular relevance is put on evaluation, as focusing on subjective disagreement the existence of a ‘truth’ cannot be assumed. Submissions are evaluated using two metrics. 'Soft' evaluation is our primary form of evaluating performances, i.e. an evaluation that considers how well the model's probabilities reflect the level of agreement among annotators. Besides, to mantain a link with a more traditional form of evaluation, also the 'hard' evaluation (F1) is considered.
For more detailed information about it see the section "Evaluation" and the useful material provided in the starting_kit.
+++ Please subscribe to our google group to be updated with the news about the task +++
Submissions are evaluated using two metrics. Focusing on subjective disagreement, the existence of a ‘truth’ cannot be assumed. To this end, a 'soft' evaluation using cross-entropy will be taken as the primary metric. A 'hard' evaluation using F1 is also performed.
1. 'Soft' evaluation: evaluates how well the model's probabilities reflect the level of agreement among annotators. It is measured using cross-entropy, as a model that correctly predicts the distribution of labels produced by the crowd for each item, will have low cross-entropy.
2. 'Hard' evaluation: evaluates how well the results submitted align with the preferred (gold) interpretation. Each item is considered correct if it assigns the maximum probability to the preferred interpretation (if available). It is measured using micro-F1
An ideal model would have a high micro-F1 and low cross-entropy results. However, for the scope of this competition, the primary evaluation considered is the soft one (cross-entropy).
Notes:
In the starting_kit package, you can find the code used to calculate the scores and an example of the data format for submission [see more details in the "data format and data submission format section"].
Baseline: our baseline (organizers_baseline in the Results tables) is a majority baseline. The scores obtained assigning each item to the majority group (Soft label always ["0": 1.0, "1": 0.0], Hard label always "0" )
After submission, if you want, you can see your scores (see scores_correct.txt file) and show them in the "Results table" (notice that there are two separated results tables, one for the practice phase, one for the evaluation phase).
For each submission, you can decide to submit as many datasets as you wish (one file for each dataset). For the dataset(s) you don't submit, the scores of the majority baseline are assigned.
We will consider a rank with the average performances across datasets (if submissions for alla datasets are not present, we assign our baseline score). We will consider also separate rankings for each datasets. You can re-order the rankings within each for each dataset in the Results table by clicking on the arrows of the header.
1. Submissions valid for the competition must be made during the evaluation phase. You may submit a total of 5 submissions.
2. For each submission, you may submit results for any number of datasets (at least one dataset). Each dataset will have a separate leaderboard. A general leaderboard will also be created, showing the average results across datasets. In case not all the tasks are submitted, the baseline performance will be considered for the average.
4. Use a single crowd learning methodology or framework across all the tasks and datasets. For example, if your framework is to aggregate the crowd labels using Majority Voting and train on the aggregated labels, use this same methodology for all of the tasks you participate in.
5. You must incorporate the crowd labels into your training framework.
6. You may not create multiple accounts or belong to several teams.
1. Submissions valid for the competition must be made during the evaluation phase. You may submit a total of 5 submissions.
2. For each submission, you may submit results for any number of datasets (at least one dataset). Each dataset will have a separate leaderboard. A general leaderboard will also be created, showing the average results across datasets. In case not all the tasks are submitted, the baseline performance will be considered for the average.
4. Use a single crowd learning methodology or framework across all the tasks and datasets. For example, if your framework is to aggregate the crowd labels using Majority Voting and train on the aggregated labels, use this same methodology for all of the tasks you participate in.
5. You must incorporate the crowd labels into your training framework.
6. You may not create multiple accounts or belong to several teams.
If you want to stay updated on the task or share information with the other participants, join our participants group.
For direct inquiries, please contact us at this address
Follow us on Twitter for any news related to learning with disagreements!
See also our webpage
All deadlines are 23:59 anywhere on earth (UTC-12).
Elisa Leonardelli, Fondazione Bruno Kessler (FBK), Italy
Gavin Abercrombie, Heriot-Watt University, United Kingdom
Dina Almanea, Queen Mary University of London, United Kingdom
Valerio Basile, Torino University, Italy
Tommaso Fornaciari, Bocconi University, Italy
Barbara Plank, IT University of Copenhagen, Denmark
Verena Rieser, Heriot-Watt University, United Kingdom
Massimo Poesio, Queen Mary University of London, United Kingdom
Alexandra Uma, Queen Mary University of London, United Kingdom
Lora Aroyo, Google
Jon Chamberlain, University of Essex, United Kingdom
Marie-Catherine de Marneffe, Ohio State University, USA
Dirk Hovy, Università Bocconi, Italy
Tristan Miller, Austrian Research Institute for Artificial Intelligence, Austria
Nafise Moosavi, University of Sheffield, United Kingdom
Silviu Paun, Amazon
Edwin Simpson, University of Bristol, United Kingdom
Sara Tonelli, Fondazione Bruno Kessler (FBK), Italy
Yannick Versley, Amazon.
We encourage participants in developing methods able to capture agreements/disagreements, rather than focusing on developing the best models. To facilitate participants in testing their methods across all the datasets, we developed an harmonized format across all datasets, a format that emphasize common features of datasets, while maintaining their diversities. Datasets are released in json format, below an example for each dataset is shown.
Each item of each dataset presents several fields that are common between datasets: unique_id, text, annotators, annotations, number of annotations, hard label, soft label, lang, split, other info.
In the field "other info", the information that is dataset-specific is contained and differs from dataset to dataset. Participants can leverage on this dataset-specific information to improve performance for a specific dataset. It contains:
To note that, although the fields are the same, they contain substantially different information across datasets:
To note that within each dataset, annotators have been assigned and identifier starting from one (Ann1, Ann2 etc.). While within each dataset, the same annotator ID refer to the same annotator, no annotators are in common between datasets.
To note that HS-Brexit, ArMIS and ConvAbuse datasets have been annotated by a small crowd of annotators (max 8 different annotators), conversely the MD-Agreement dataset has been annotated by a crowd of > 800 annotators (via AMT).
To note that for some datasets (ArMIS and ConvAbuse), the number of annotations collected for each item are not always odd, and majority is not always possible (e.g. soft label "0" and soft_label "1" are 0.5). In this (few) cases, the hard label has been assigned in a random manner.
Download | Size (mb) | Phase |
---|---|---|
Public Data | 0.007 | #0 Pre-practice Phase |
Starting Kit | 0.003 | #1 Practice Phase |
Public Data | 1.113 | #1 Practice Phase |
Public Data | 0.340 | #2 Evaluation Phase |
Public Data | 1.483 | #3 Post-competition |
The datsets:
Akhtar, S., Basile, V., & Patti, V. (2021). Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection. arXiv preprint arXiv:2106.15896.
Almanea, D., & Poesio, M. ArMIS-The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements.
Leonardelli, E., Menini, S., Aprosio, A. P., Guerini, M., & Tonelli, S. (2021). Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators' Disagreement. arXiv preprint arXiv:2109.13563.
Curry, A. C., Abercrombie, G., & Rieser, V. (2021). ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. arXiv preprint arXiv:2109.09483.
Start: July 13, 2022, midnight
Description: In this initial phase, you are provided with a sample of the data for each dataset (see 'Get data sample' in the Overview section). Submissions in Codalab are not allowed. If you intend to partecipate, please join also our google group so to be updated with any news about the task.
Start: Sept. 13, 2022, 7 a.m.
Description: This is the practice phase of the competition. You can partecipate to Codalab and access the "Participate" and "Results" sections. In this phase, you are provided with the training and validation parts of the datasets (though labels for validation will be provided in the next phase). You are expected to craft novel approaches for training the models using the crowd labels. Submissions are unlimited and are evaluated on the validation part (cross-entropy and micro-F1). Leaderboard is public so to allow participants to see how their model performs and compare the performance with others. You might find useful information about submission in the "Data format and submission data format" section. You might find useful snippets of code (how to load data and evaluate data) and an example of the submission files in the starting_kit.
Start: Jan. 10, 2023, midnight
Description: In this phase of the competition, the (unlabelled) test data is released. Participants are expected to use the models trained in the practice phase to make predictions on the test data and make submissions for the test data. The number of submissions for this phase is limited to 5 submissions per participant, to prevent the participants from fine-tuning their models on the test data. However, in this phase we release the labels for the development set to facilitate quick offline testing and refining of models. You can learn more details about evaluation in the Evaluation section (Learn the Details tab). Please note that in the Leaderboard, the "entries" number differs from the number of valid submissions (which are max 5), because it counts also the failed ones.
Start: Jan. 31, 2023, 11:59 p.m.
Description: The official competition ends in the evaluation phase. However, the Post-competition phase allows you to continue to refine and test your models.
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | guneetsk99 | 5.38 |
2 | nasheedyasin | 3.09 |