PragTag-2023

Organized by arrdata - Current server time: March 29, 2025, 11:19 p.m. UTC

First phase

Sandbox
May 17, 2023, midnight UTC

End

Competition Ends
July 21, 2023, midnight UTC

PragTag-2023: The First Shared Task on Pragmatic Tagging of Peer Reviews

31.07.2023 UPDATE: We thank all the participants for their submissions and are looking forward to learn more about the systems! Given the previous deadline extension, we have decided to move the system report submission deadline by one week. We also release detailed reporting guidelines, which can be found here.

20.07.2023 UPDATE: Several participants had trouble with submissions due to the confusion about the total vs daily number of submissions allowed by CodaLab. We decided to increase the total submission number to FIVE for all phases for all participants, effective immediately. We hope that this allows you to get your best model through. Good luck!

30.06.2023 UPDATE: As the final phase of the competition begins, we release a new test input bundle that includes previous test data plus inputs from a new, "secret" domain. The data can be found under Participate > Files (purple button). Your submission to the Final phase is the predictions of your systems for these new test data inputs.

 

In this shared task we invite the community to explore cross-domain low-resource processing of peer reviews using the recently proposed pragmatic tagging task as an objective, and the recently introduced open multi-domain corpus of peer reviews as a rich auxiliary data source. This page provides a general overview of the competition. The shared task takes place as part of the 10th Workshop on Argument Mining at EMNLP-2023. In addition to entering the competition via codalab, please >REGISTER< for the task by filling out an external form.

Motivation

Peer review is a key element of the scientific process. Peer review is challenging and could greatly benefit from automation and assistance. At the core of peer review lie review reports -- short argumentative texts where reviewers evaluate the papers and make revision suggestions. Automatic analysis of argumentation in peer reviews has numerous applications, from facilitating meta-scientific analysis of reviewing practices, to aggregating information from multiple reviews and assisting less experienced reviewers. Yet, several challenges remain open. Peer reviews are scientific text and models pre-trained on general data might suffer from domain shift. The performance of the last-generation LLMs in this domain remains unknown. Reviewing practices and criteria vary across research communities and publication venues, posing an additional challenge to generalization. Finally, reviewing data is scarce and expensive to label. The competition aims to explore the solutions to these challenges.

Schedule

All deadlines are midnight UTC+0.

May 17th, 2023 Registration start. Training, test and auxiliary data release.
June 30th, 2023 Registration end. Secret set release. Final evaluation starts.
July 21st, 2023 Final evaluation output submission due. End of competition.
August 11th 18th, 2023   System report submission due
September 18th, 2023   System report camera-ready due
(October 18th, 2023) (Shared task paper due)
December 6-7th, 2023   Workshop

Task

Pragmatic tagging example

Peer review is used across all fields of research, from scientific policy studies to software development. The topics, values and publication standards are unique to each research community. Yet, the goals of reviewer's argumentation are similar: to summarize and evaluate the work, and to help the authors improve their manuscript. One way to capture this regularity and enable the study of peer review across communities is pragmatic tagging: a coarse sentence labeling task, where each peer review sentence is assigned to one of the pragmatic categories:

  • Recap summarizes the manuscript: "The paper proposes a new method for..."
  • Strength points out the merits of the work: "It is very well written and the contribution is significant."
  • Weakness points out a limitation: "However, the data is not publicly available, making the work hard to reproduce"
  • Todo suggests the ways a manuscript can be improved: "Could the authors devise a standard procedure to obtain the data?"
  • Other contains additional information, e.g. reviewer's thoughts, background knowledge and performative statements: "Few examples from prior work: [1], [2], [3]", "Once this is clarified, the paper can be accepted."
  • Structure is used to organize the reviewing report: "Typos:"

Prior work has shown that this simple sentence-level schema is well-applicable across different research fields and communities, and yields a good inter-annotator agreement of ~0.70+ Krippendorff alpha upon human annotation. The sources of disagreement ("hard cases") are the coarse granularity of the schema (necessary for generalization), sentence-level analysis (necessary to avoid discrepancies due to differences in sub-sentence splitting), and the natural ambiguity of the classes (e.g. Weakness vs Todo).

Details

The participants of the shared task are invited to explore the task of cross-domain pragmatic tagging in low-resource scenarios detailed in the Evaluation, according to the Rules. We provide two types of data to support the task: task data contains pragmatic tag annotations across multiple domains, while auxiliary data is an additional data source for pre-training and fine-tuning the models. The Data page outlines the details. The competition consists of four phases: Phases 2-4 run simultaneously and accommodate submissions in three experimental conditions: no-data, low-data and full-data. Phase 5 is a final evaluation phase, where the participants submit the outputs of their models created in any of the experimental conditions, which are evaluated against the test sets from Phases 2-4, as well as a previously unreleased secret test set from a new domain. Each team can make up to three submissions to each of these phases (see Rules). Phase 1 serves as a sandbox for troubleshooting the submissions, with an unlimited number of submissions per team.

Contact

  • Nils Dycke, UKP Lab, Technical University of Darmstadt
  • Ilia Kuznetsov, UKP Lab, Technical University of Darmstadt

Please use our Forum for questions and announcements. Please use arr-data [at] ukp.informatik.tu-darmstadt.de for private inquiries.

References

This shared task is organized by the InterText team at UKP Lab: https://intertext.ukp-lab.de/.

Please cite the following papers when using the data or participating in the shared task:

Evaluation

Conditions

The goal of this shared task is to investigate pragmatic tagging performance across domains under data scarcity. We invite the participants to submit their systems trained under following three conditions:

  • No-data: the system has observed no instances in the task data and operates zero-shot.
  • Low-data: the system has only observed 20% of the task data (33 reviews, 739 sentences)
  • Full-data: the system has observed 100% of the task data (117 reviews, 2326 sentences)

The data slices for each condition are provided by the organisers. For the approaches based on fine-tuning, we simply require fine-tuning on only the specified data slice and in addition potentially using the accompanying F1000raw and ARR datasets. For the approaches based on prompting and in-context learning, the data slice determines the pool of examples available to supplement the prompt. Fine-tuning-based and prompting-based approaches will be evaluated jointly. Please refer to the Rules for further details.

Use of external data

To ensure level playing field, we require that the participants do not fine-tune their models on any additional data apart from the training and auxiliary data provided in the shared task package. Each submitted model output will be evaluated separately on each of the domains in the test set, plus the secret test set.

Scoring

The model outputs will be scored by average performance across domains, in each of the conditions. As an evaluation metric for each domain, we use macro-F1 computed across all review sentences of that domain. We require the participants to use a standard naming convention for their output files (see Rules). During the final evaluation, the participants are allowed to submit models trained in any of the no-/low-/full-data settings, which are evaluated against the previous test data, as well as new test data in a secret domain. We provide starter kit containing the test script and boilerplate code for the task baselines.

Terms and Conditions

By participating in this Shared Task, you agree to the following terms and conditions. In case of discrepancies, the conditions and information provided on this page have priority. In case of questions, feel free to contact the organizing team.

  • By participating in this competition, you agree that your submitted outputs can be used by the shared task organizers to calculate and publicly display performance scores, as well as to conduct qualitative analysis of the submitted model outputs. You agree that the choice of the performance metric and evaluation procedure are decided by the task organizers.
  • You agree that the shared task organizers are under no obligation to release the scores, and that the scores can be withheld if the submission is deemed incomplete, deceptive, or violates the letter or spirit of the competition rules. The organizers and their affiliated institutions make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • A participant can be involved in one team and must register to participate. Submissions are made to one of the four experimental conditions: no-data (zero), low-data, full-data and final evaluation. The training data for all conditions is provided by the organizers. The development split selection is left to the participants; using the test set results for model selection is not permitted and leads to disqualification. The test set inputs are provided by the task organizers, and the submissions consist of the system outputs for the given test inputs. The submission files are required to follow a naming convention submission_[condition_name].zip, where [condition_name] reflects the experimental condition: {low|full|zero}. The zip archiv contains exactly one file called "predicted.json" containing the model outputs. Misrepresenting the experimental condition (i.e. submitting a model trained on full data to the no-data condition) will result in disqualification.
  • System outputs are automatically evaluated against the gold labels via CodaLab; we perform one evaluation for each domain and rank systems by the average domain performance. The secret data consists of previously unreleased labeled data in a new domain. The evaluation scripts are made publicly available as part of the starting kit. The participants are requested to promptly report any issues encountered in the evaluation script. Each team is allowed to make up to three five (20.07.2023) submissions to each of the conditions, plus up to three five submissions for the final evaluation.
  • We expect most participants to use an openly available pre-trained language model as a starting point. Fine-tuning and otherwise using the auxiliary data provided by the organizers is encouraged. Fine-tuning and otherwise using any other data during pre-training is not allowed. To ensure reproducibility, we require the participants to use pre-trained models for which open weights are available (incl. LLMs like LLaMA). The models with no available weights or the models that require the use of paid APIs (like OpenAI API) can be reported, but will not be included in the final ranking of the results to ensure fairness.
  • Upon competition end, the teams are expected to submit a system description paper describing their approach, following the 10th Workshop on Argument Mining formatting guidelines and not exceeding 4 pages without references. The system description papers will be published in the workshop proceedings. We encourage the teams to report positive as well as negative results and clearly indicate the system that was used for the submission. The teams are further encouraged to submit all the details necessary for reproducing their results, incl. hyperparameter settings, prompts (both successful and unsuccessful), and example selection strategies, as well as analysis. (31.07.2023 UPDATE) Further details can be found in our submission guidelines.

Sandbox

Start: May 17, 2023, midnight

Description: Troubleshoot your submission on a toy data sample.

No-data

Start: May 17, 2023, midnight

Description: No training data available (zero-shot).

Low-data

Start: May 17, 2023, midnight

Description: Few labeled examples are available per domain.

Full-data

Start: May 17, 2023, midnight

Description: All training data is available.

Final

Start: June 30, 2023, midnight

Description: Evaluate your best model in a secret domain.

Competition Ends

July 21, 2023, midnight

You must be logged in to participate in competitions.

Sign In