LongEval 2023 Classification

Organized by CLEF-LongEval-Classification - Current server time: Nov. 21, 2025, 6:13 a.m. UTC

Evaluation

May 4, 2023, midnight UTC

Current

Post-Evaluation

May 23, 2023, midnight UTC

End

Competition Ends

Never

Overview
Evaluation
Terms and Conditions

Welcome to LongEval Classification task:

In this first edition of the LongEval Lab, we look at the temporal persistence of the systems’ performance. In order to include the feature of temporal persistence as an additional quality for models proposed, participants are asked to suggest temporal IR systems (Task 1) and longitudinal text classifiers (Task 2) that generalize well beyond a train set generated within a limited time frame.

We consider two types of temporal persistence tasks: temporal information retrieval and longitudinal text classification. For each task, we look at a short-term and a long-term performance persistence. We aim to answer a high level question:

Given a longitudinal evolving benchmark for a typical NLP task, what types of models offer better temporal persistence over a short term and a long term?

Objective and scope:

The goal of LongEval-Classification Subtask B of CLEF 2022 Task 2 is to propose a temporal persistence classifier which can mitigate performance drop over short and long periods of time compared to a test set from the same time frame as training.

Given a test set from the same time frame as training and evaluation sets from short/long time periods, the task is to design a classifier that can mitigate short/long term performance drops.

The organizers will provide a training set collected over a time frame up to a time t and two test sets: test set from time t and test set from time t+i where i=1 for sub-task A and i>1 for subtask B.

*Sub-task short-term persistence.* In this task, participants will be asked to develop models that demonstrate performance persistence over short periods of time (within 1 year from the training data).
*Sub-task long-term persistence.* In this task, participants will be asked to develop models that demonstrate performance persistence over a longer period of time (over 1 year apart from the training data).

This notebook provides a baseline for each setting in LongEval-Classification Subtask B of CLEF 2022 Task 2.
Please start by stepping through this notebook so you have a clear idea as to what is expected of the task and what you need to submit. These baselines are based on the results described in the paper “Building for Tomorrow: Assessing the Temporal Persistence of Text Classifiers”.

Model-centric approches

Participants in this setting are expected to design an experimental architecture to enhance a sentiment text classifier's temporal performance persistence. Participants are asked to evaluate their models in this environment without adjusting them for target years' timing. With a focus on hyperparameter tuning and objective function optimization techniques, this will enable the model to be ranked based on its short- and long-term persistence.

Data-centric approaches

Participants in this setting are expected to design a data-centric experimental design to fine-tune a sentiment text classifier's expected to classifay new evolving testing sets. Participant are permitted to use supervised/unsupervised time-specific approaches. However, all data-centric systems should be evaluated using all three testing sets though insuring consistency i.e. if participant are goting to use a classifier tuned using one year, then should this be evaluted using all the testing sets.

For this setting, additionally fine-tune on the unsupervised longitudinal data provided :“unannotated sentiment dataset”.

Where Time Gap is the different in time between test and training time frames i.e. when testing year = 2014 and train year = 2013 then Time Gap = +1 similarly when testing and traning from same time frame then Time Gap = 0 as in within-time testing set.

In this shared task we aim to rank models based on their performance separately for within time (testing year == train year) , short term and long term (testing year > train year). This will allow us to see if one model is more persistance overtime though placing first rank in all longitudinal datasets.

Evaluation Criteria

The submissions will be ranked by *macro F1-score* separately for within, short and long term datasets, though winner systems are the one ranked top using averaged results over all testing sets. The metrics will be computed as follows:

Macro F1-Score for each individual submission for: within, short and long testing sets Check scikit-learn implementaions

from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='macro')

Relative Performance Drop (RPD) for short and long term datasets compared to within testing set adopted from prior work “[Opinions are Made to be Changed: Temporally Adaptive Stance Classification](https://dl.acm.org/doi/10.1145/3472720.3483620)%E2%80%9D.

def calculate_RPD (temporal_fscore, within_fscore):
    RPD = (temporal_fscore -  within_fscore) / within_fscore
    return RPD

This it to quantify drop by comparing [within with short] and [within with long] to see if performance of 'within time' is a good quantifier for model's robustness evaluation.

Average macro f1-score for all three submissions [within, short and long]

Terms and Conditions

By participating in this task you agree to the following terms and conditions.

(In very specific instances, we might make some exceptions. Please contact the task organisers if you have any further queries.)

By submitting results to this competition, you explicitly consent to the public release of your scores on this website and at LongEval 2023 and in the associated proceedings, at the task organizers' discretion.
Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules.
Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
Each participant can be part of exactly one team.
Each team must create and use exactly one CodaLab account.
The members of a team cannot be changed after the start of the evaluation period.
During the evaluation period, each team can submit 5 submissions. However, only the final submission will be considered as the official submission to the competition. Also, you will not be able to see the results of your submission on the test set. Until after the evaluation period is complete.
The final submissions of each team will be made public at the end of the competition.
The organisers and their affiliated institutions make no warranties regarding the dataset provided for this task including but not limited to correctness or completeness. They cannot be held liable for providing access to the datasets or the usage of the datasets.
The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
The dataset must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.

Practice

Start: April 25, 2023, midnight

Evaluation

Start: May 4, 2023, midnight

Post-Evaluation

Start: May 23, 2023, midnight

Competition Ends

Never

You must be logged in to participate in competitions.

Competition

LongEval 2023 Classification

Previous

Current

End

Welcome to LongEval Classification task:

Objective and scope:

Model-centric approches

Data-centric approaches

Evaluation Criteria

Terms and Conditions

Practice

Evaluation

Post-Evaluation

Competition Ends