In this first edition of the LongEval Lab, we look at the temporal persistence of the systems’ performance. In order to include the feature of temporal persistence as an additional quality for models proposed, participants are asked to suggest temporal IR systems (Task 1) and longitudinal text classifiers (Task 2) that generalize well beyond a train set generated within a limited time frame.
We consider two types of temporal persistence tasks: temporal information retrieval and longitudinal text classification. For each task, we look at a short-term and a long-term performance persistence. We aim to answer a high level question:
Given a longitudinal evolving benchmark for a typical NLP task, what types of models offer better temporal persistence over a short term and a long term?
The goal of LongEval-Classification Subtask B of CLEF 2022 Task 2 is to propose a temporal persistence classifier which can mitigate performance drop over short and long periods of time compared to a test set from the same time frame as training.
Given a test set from the same time frame as training and evaluation sets from short/long time periods, the task is to design a classifier that can mitigate short/long term performance drops.
The organizers will provide a training set collected over a time frame up to a time t and two test sets: test set from time t and test set from time t+i where i=1 for sub-task A and i>1 for subtask B.
*Sub-task short-term persistence.* In this task, participants will be asked to develop models that demonstrate performance persistence over short periods of time (within 1 year from the training data).
*Sub-task long-term persistence.* In this task, participants will be asked to develop models that demonstrate performance persistence over a longer period of time (over 1 year apart from the training data).
This notebook provides a baseline for each setting in LongEval-Classification Subtask B of CLEF 2022 Task 2.
Please start by stepping through this notebook so you have a clear idea as to what is expected of the task and what you need to submit. These baselines are based on the results described in the paper “Building for Tomorrow: Assessing the Temporal Persistence of Text Classifiers”.
Participants in this setting are expected to design an experimental architecture to enhance a sentiment text classifier's temporal performance persistence. Participants are asked to evaluate their models in this environment without adjusting them for target years' timing. With a focus on hyperparameter tuning and objective function optimization techniques, this will enable the model to be ranked based on its short- and long-term persistence.
Participants in this setting are expected to design a data-centric experimental design to fine-tune a sentiment text classifier's expected to classifay new evolving testing sets. Participant are permitted to use supervised/unsupervised time-specific approaches. However, all data-centric systems should be evaluated using all three testing sets though insuring consistency i.e. if participant are goting to use a classifier tuned using one year, then should this be evaluted using all the testing sets.
For this setting, additionally fine-tune on the unsupervised longitudinal data provided :“unannotated sentiment dataset”.
Where Time Gap is the different in time between test and training time frames i.e. when testing year = 2014 and train year = 2013 then Time Gap = +1 similarly when testing and traning from same time frame then Time Gap = 0 as in within-time testing set.
In this shared task we aim to rank models based on their performance separately for within time (testing year == train year) , short term and long term (testing year > train year). This will allow us to see if one model is more persistance overtime though placing first rank in all longitudinal datasets.
The submissions will be ranked by *macro F1-score* separately for within, short and long term datasets, though winner systems are the one ranked top using averaged results over all testing sets. The metrics will be computed as follows:
from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='macro')
def calculate_RPD (temporal_fscore, within_fscore):
RPD = (temporal_fscore - within_fscore) / within_fscore
return RPD
This it to quantify drop by comparing [within with short] and [within with long] to see if performance of 'within time' is a good quantifier for model's robustness evaluation.
By participating in this task you agree to the following terms and conditions.
(In very specific instances, we might make some exceptions. Please contact the task organisers if you have any further queries.)
Start: April 25, 2023, midnight
Start: May 4, 2023, midnight
Start: May 23, 2023, midnight
Never
You must be logged in to participate in competitions.
Sign In