SHROOM - a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Organized by tmickus - Current server time: Jan. 9, 2025, 8:54 p.m. UTC

Previous

Evaluation
Jan. 10, 2024, midnight UTC

Current

Post-Evaluation
Feb. 1, 2024, 11:59 a.m. UTC

End

Competition Ends
Never

SHROOM: a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes!

Welcome to the codalab website for SemEval-2024 Task-6 — SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes!

 

Gentle reminder: The evaluation phase ends on Jan 31st UTC-12!

Task description

SHROOM participants will need to detect grammatically sound output that contains incorrect semantic information (i.e. unsupported or inconsistent with the source input), with or without having access to the model that produced the output.

Overview of the task

The modern NLG landscape is plagued by two interlinked problems: On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, i.e., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission-critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. With our shared task, we hope to foster the growing interest in this topic in the community.

Participants will be asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different setups: model-aware and model-agnostic tracks.
Simply put, participants must detect grammatically sound outputs which contain incorrect or unsupported semantic information, inconsistent with the source input, with or without having access to the model that produced the output.

To that end, we will provide participants with a collection of checkpoints, inputs, references and outputs of systems covering three different NLG tasks: definition modeling (DM), machine translation (MT) and paraphrase generation (PG), trained with varying degrees of accuracy. The development and test set will provide binary annotations from at least five different annotators and a majority vote gold label.

Getting started

We maintain up-to-date information for this shared task on the task website, which also contains link to the data (trial, validation and unlabeled train sets are already available).

Evaluation protocol

Submissions will be divided into two tracks: a model-aware track, where we provide a checkpoint to a model publically available on HuggingFace for every datapoint considered, and a model-agnostic track where we do not. We highly encourage participants to make use of model checkpoints in creative ways.
For both tracks, all participants' submissions will be evaluated using two criteria:

  1. the accuracy that the system reached on the binary classification; and
  2. the Spearman correlation of the systems' output probabilities with the proportion of the annotators marking the item as overgenerating

Shared task timeline

Here are the key dates participants should keep in mind. These dates are subject to change.

  • September 11, 2023: Development data made available
  • September 22, 2023: Unlabelled training data made available
  • January 10, 2024: Evaluation data made available & evaluation start
  • January 31, 2024: Evaluation end
  • February 19, 2024: Paper submission due
  • March 18, 2024: Notification to authors
  • April 1, 2024: Camera-ready version due

Camera-ready due date and SemEval 2024 workshops will be announced at a later date. Further information is also available on the SemEval 2024 website.

Get in touch!

There’s a Google group for all prospective participants: check it out at semeval-2024-task-6-shroom@googlegroups.com. You can also reach out to us on Twitter @shroom2024.

Evaluation protocol

Submissions will be divided into two tracks: a model-aware track, where we provide a checkpoint to a model publically available on HuggingFace for every datapoint considered, and a model-agnostic track where we do not. We highly encourage participants to make use of model checkpoints in creative ways.
For both tracks, all participants' submissions will be evaluated using two criteria:

  1. the accuracy that the system reached on the binary classification; and
  2. the Spearman correlation of the systems' output probabilities with the proportion of the annotators marking the item as overgenerating

The evaluation script will be made available shortly, along with baseline systems and format checkers.

Terms and Conditions

Participants should generally adopt a spirit of good sportsmanship and avoid any unfair or otherwise unconscionable conduct. We provide the following terms and conditions to clearly delineate the guidelines to which the participants are expected to adhere. Organizers reserve the right to amend in any way the following terms, in which case modifications will be advertised through the shared task mailing list and the CodaLab forums.
Participants may contact the organizers if any of the following terms raises their concern.

Participation to the competition: Any interested person may freely participate to the competition. By participating to the competition, you agree to the terms and conditions in their entirety, without amendment or provision. By participating to the competition, you understand and agree that your scores and submissions will be made public.
Scores and submissions are understood as any direct or indirect contributions to this site or the shared task organizers, such as, but not limited to: results of automatic scoring programs; manual, qualitative and quantitative assessments of the data submitted; etc.
Participants may create teams. Participants may not be part of more than one team. Teams and participants not belonging to any team must create exactly one account to the codalab competition. Team composition may not be changed once the evaluation phase starts.

Scoring of submissions: Organizers are under no obligation to release scores. Official scores may be withheld, amended or removed if organizers judge the submission incomplete, erroneous, deceptive, or violating the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
Submission files will be grouped according to the track. The two tracks have separate leaderboards. There will therefore be two separate rankings, one for each track. Submissions will be ranked based on accuracy and break ties (if any) based on correlation scores. Participants/teams will be ranked based on their highest performing submission.

Data usage: The provided data should be used responsibly and ethically. Do not attempt to misuse it in any way, including, but not limited to, reconstructing test sets, any non-scientific use of the data, or any other unconscionable usage of the data.

Submission of system description papers: Participants having made at least one submission during the evaluation phase will be invited to submit a paper describing their system. We highly encourage authors to include a link to the code of systems being described will be made available to organizers or the public at large. Participants submitting a system description paper will also be asked to review papers submitted by their peers in a single-blind process.
We further encourage system description papers to include a manual analysis of their systems results and productions. The presence and quality of such an analysis will be assessed during the review process. The task description paper will also devote a significant amount of space to highlighting outstanding manual evaluations conducted by participants.

Evaluation

Start: Jan. 10, 2024, midnight

Post-Evaluation

Start: Feb. 1, 2024, 11:59 a.m.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 ahoblitz 0.84733
2 zackchen 0.83600
3 liuwei 0.83067