31.07.2023 UPDATE: We thank all the participants for their submissions and are looking forward to learn more about the systems! Given the previous deadline extension, we have decided to move the system report submission deadline by one week. We also release detailed reporting guidelines, which can be found here.
20.07.2023 UPDATE: Several participants had trouble with submissions due to the confusion about the total vs daily number of submissions allowed by CodaLab. We decided to increase the total submission number to FIVE for all phases for all participants, effective immediately. We hope that this allows you to get your best model through. Good luck!
30.06.2023 UPDATE: As the final phase of the competition begins, we release a new test input bundle that includes previous test data plus inputs from a new, "secret" domain. The data can be found under Participate > Files (purple button). Your submission to the Final phase is the predictions of your systems for these new test data inputs.
In this shared task we invite the community to explore cross-domain low-resource processing of peer reviews using the recently proposed pragmatic tagging task as an objective, and the recently introduced open multi-domain corpus of peer reviews as a rich auxiliary data source. This page provides a general overview of the competition. The shared task takes place as part of the 10th Workshop on Argument Mining at EMNLP-2023. In addition to entering the competition via codalab, please >REGISTER< for the task by filling out an external form.
Peer review is a key element of the scientific process. Peer review is challenging and could greatly benefit from automation and assistance. At the core of peer review lie review reports -- short argumentative texts where reviewers evaluate the papers and make revision suggestions. Automatic analysis of argumentation in peer reviews has numerous applications, from facilitating meta-scientific analysis of reviewing practices, to aggregating information from multiple reviews and assisting less experienced reviewers. Yet, several challenges remain open. Peer reviews are scientific text and models pre-trained on general data might suffer from domain shift. The performance of the last-generation LLMs in this domain remains unknown. Reviewing practices and criteria vary across research communities and publication venues, posing an additional challenge to generalization. Finally, reviewing data is scarce and expensive to label. The competition aims to explore the solutions to these challenges.
All deadlines are midnight UTC+0.
May 17th, 2023 | Registration start. Training, test and auxiliary data release. |
June 30th, 2023 | Registration end. Secret set release. Final evaluation starts. |
July 21st, 2023 | Final evaluation output submission due. End of competition. |
August 11th 18th, 2023 | System report submission due |
September 18th, 2023 | System report camera-ready due |
(October 18th, 2023) | (Shared task paper due) |
December 6-7th, 2023 | Workshop |
Peer review is used across all fields of research, from scientific policy studies to software development. The topics, values and publication standards are unique to each research community. Yet, the goals of reviewer's argumentation are similar: to summarize and evaluate the work, and to help the authors improve their manuscript. One way to capture this regularity and enable the study of peer review across communities is pragmatic tagging: a coarse sentence labeling task, where each peer review sentence is assigned to one of the pragmatic categories:
Prior work has shown that this simple sentence-level schema is well-applicable across different research fields and communities, and yields a good inter-annotator agreement of ~0.70+ Krippendorff alpha upon human annotation. The sources of disagreement ("hard cases") are the coarse granularity of the schema (necessary for generalization), sentence-level analysis (necessary to avoid discrepancies due to differences in sub-sentence splitting), and the natural ambiguity of the classes (e.g. Weakness vs Todo).
The participants of the shared task are invited to explore the task of cross-domain pragmatic tagging in low-resource scenarios detailed in the Evaluation, according to the Rules. We provide two types of data to support the task: task data contains pragmatic tag annotations across multiple domains, while auxiliary data is an additional data source for pre-training and fine-tuning the models. The Data page outlines the details. The competition consists of four phases: Phases 2-4 run simultaneously and accommodate submissions in three experimental conditions: no-data, low-data and full-data. Phase 5 is a final evaluation phase, where the participants submit the outputs of their models created in any of the experimental conditions, which are evaluated against the test sets from Phases 2-4, as well as a previously unreleased secret test set from a new domain. Each team can make up to three submissions to each of these phases (see Rules). Phase 1 serves as a sandbox for troubleshooting the submissions, with an unlimited number of submissions per team.
Please use our Forum for questions and announcements. Please use arr-data [at] ukp.informatik.tu-darmstadt.de for private inquiries.
This shared task is organized by the InterText team at UKP Lab: https://intertext.ukp-lab.de/.
Please cite the following papers when using the data or participating in the shared task:
The goal of this shared task is to investigate pragmatic tagging performance across domains under data scarcity. We invite the participants to submit their systems trained under following three conditions:
The data slices for each condition are provided by the organisers. For the approaches based on fine-tuning, we simply require fine-tuning on only the specified data slice and in addition potentially using the accompanying F1000raw and ARR datasets. For the approaches based on prompting and in-context learning, the data slice determines the pool of examples available to supplement the prompt. Fine-tuning-based and prompting-based approaches will be evaluated jointly. Please refer to the Rules for further details.
To ensure level playing field, we require that the participants do not fine-tune their models on any additional data apart from the training and auxiliary data provided in the shared task package. Each submitted model output will be evaluated separately on each of the domains in the test set, plus the secret test set.
The model outputs will be scored by average performance across domains, in each of the conditions. As an evaluation metric for each domain, we use macro-F1 computed across all review sentences of that domain. We require the participants to use a standard naming convention for their output files (see Rules). During the final evaluation, the participants are allowed to submit models trained in any of the no-/low-/full-data settings, which are evaluated against the previous test data, as well as new test data in a secret domain. We provide starter kit containing the test script and boilerplate code for the task baselines.
By participating in this Shared Task, you agree to the following terms and conditions. In case of discrepancies, the conditions and information provided on this page have priority. In case of questions, feel free to contact the organizing team.
Start: May 17, 2023, midnight
Description: Troubleshoot your submission on a toy data sample.
Start: May 17, 2023, midnight
Description: No training data available (zero-shot).
Start: May 17, 2023, midnight
Description: Few labeled examples are available per domain.
Start: May 17, 2023, midnight
Description: All training data is available.
Start: June 30, 2023, midnight
Description: Evaluate your best model in a secret domain.
July 21, 2023, midnight
You must be logged in to participate in competitions.
Sign In