Dear Organizers,

We would like to ask you about several aspects of the ongoing shared task 1 (speaker attribution in BT debates).

1) Phases of the shared task 1 (subtask 1)

Often NLP shared tasks offer a 1st phase (training) where teams can submit their predictions on a dataset, and get scores in return (often there is also a ranking, a leaderboard). This is usually very helpful to understand how the performance metric is calculated and to improve models and approaches. In a 2nd phase (evaluation), teams submit their final predictions. We wonder if we are currently in the training phase. When we click on "Participate | Submit", we learn that we can submit up to 1000 submissions per day.

However, under "Learn the Details | Evaluation" you write that teams can upload a total of 20 submissions (possibly during the evalution period).

Could you please clarify how the phases of task 1 are organized?

2) Gold cues for task 1 (subtask 2)

When we click on "Learn the details | Overview", we learn that for task 1 subtask 2, there will be gold cues provided by the organizers. Can you tell us when and where gold cues will be provided?

3) Calculation of performance metric for task 1

You have described in "Learn the details | Evaluation" how you calculate the performance metric. However, we feel that details are missing and we would like to make sure that we understand how you calculate the proportional F1 score.

On the aforementioned CodaLab page, you wrote that you would add (maybe a script to calculate?) the evaluation metrics to https://github.com/umanlp/SpkAtt-2023/tree/main/eval. Such a script would be very helpful. At the time of writing, we have not been able to find any information at that URL.

Best regards,

Team CPAa

Dear Team CPAa,

1) Phases of the shared task:

During training, it was possible to submit up to 1000 submissions per day. We now decided not to restrict the number of submissions to 20 for evaluation but to allow participants up to 1000 tests per day also for the evaluation phase.

2) Gold cues for task 1 (subtask 2)

As announced in a previous email, we will replace the test data after the submission of subtask 1 has closed (July 31). Submissions for task 1, subtask 2 will be possible from Aug1 to Aug 3. The test data with gold cues will be available from our gitlab repository.

3) Calculation of performance metric for task 1

For evaluation, we calculate the proportional overlap for precision and recall on the token level for cues and roles.

Consider the sentence "Wenn ich alleine auf die Politik im Kontext der globalen Gesundheit schaue , dann stelle ich fest , dass wir seit 2013 eine stÃ¤ndig zunehmende Befassung mit diesem Thema haben ;"

Let's assume the gold annotations are:

"Message": [ "7:18", "7:19", "7:20", "7:21", "7:22", "7:23", "7:24", "7:25", "7:26", "7:27", "7:28", "7:29" ],

"Source": [ "7:15" ],

"Cue": [ "7:14" ],

"PTC": [ "7:16" ]

and your system predictions are:

"Message": [ "7:17", "7:18", "7:19", "7:20", "7:21", "7:22", "7:23", "7:24", "7:25", "7:26", "7:27", "7:28", "7:29" ],

"Source": [ "7:15" ],

"Cue": [ "7:14" ],

Then you would have 1 true positive for Cues and 13 true positives for roles, with 1 false positive (7:17) and 1 false negative (7:16). Of course, predictions for roles are only counted as correct if the cue has been identified correctly.

The scores are then computed as:

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 = 2 * Precion * Recall / (Precision + Recall)

The joint score simply adds up the true positives for cues and roles, the false positives and the false negatives, and then computes prec, rec and f1 as specified above. For this example, the joint score is based on 13+1 TPs, 1 FP and 1 FN:

prec cues: 1.0

recall cues: 1.0

f1 cues: 1.0

prec roles: 0.929

recall roles: 0.929

f1 roles: 0.929

prec joint: 0.933

recall joint: 0.933

f1 joint: 0.933

The evaluation script is integrated in CodaLab and you can run up to 1000 evaluation runs per day.

Best,

Ines (for the shared task organizers)

Dear Organizers,

I am still not sure if I understand the calculation of the performance metric (for task 1) correctly. In particular, how does this generalize to multiple gold annotations and predictions, in particular when multiple predictions overlap a single gold annotation?

For instance, consider the sentence "Jetzt erteile ich das Wort der Kollegin Sabine Zimmermann, Die Linke".

As gold tags, we have (Annotation A):

"Cue": ["6:1", "6:4"]

"Addr": ["6:5", "6:6", "6:7", "6:8", "6:9", "6:10", "6:11"]

Now suppose we have the following two predictions:

(Prediction A')

"Cue": ["6:1"]

"Addr": ["6:5", "6:6", "6:7", "6:8", "6:9"]

(Prediction B')

"Cue": ["6:3", "6:4"]

"Addr": ["6:10", "6:11"]

1) Concerning cue metrics: How are the TP/FP/FN calculated for cues? E.g., do I have 0 false negatives (all cues are covered by A' and B' combined), or do I have 2 false negatives (one in each prediction)?

2) Concerning role metrics: Which one(s) of the Addr predictions is now used to compute TP/FP/FN? All individually? None (because in none of the predictions "the cue has been identified correctly")?

Thank you and best regards,

Anton

Dear Anton,

We also compute overlap for the cues, meaning that if at least one token for a cue has been identified correctly, we compute overlap for this cue and its roles. For the example you gave, we would get the following scores (below):

Gold:

"Cue": ["6:1", "6:4"]

"Addr": ["6:5", "6:6", "6:7", "6:8", "6:9", "6:10", "6:11"]

Pred:

"Cue": ["6:1"]

"Addr": ["6:5", "6:6", "6:7", "6:8", "6:9"]

For the cues, we get 1 TP and 1 FN (and no FP).

For the roles, we have 5 TP and 2 FN (and no FP).

That gives us:

Cues(P): 1 / 1 = 1.0

Cues(R): 1 / 2 = 0.5

Roles(P): 5 / 5 = 1.0

Roles(R): 5 / 7 = 0.714

Joint(P): 6 / 6 = 1.0

Joint(R): 6 / 9 = 0.667

Joint(F1): 0.8

I hope that helps. Don't hesitate to ask if you have any further questions.

Best,

Ines

Dear Ines,

thank you for your clarifications. Still however, I am struggling to understand the scoring system in the presence of *multiple* overlapping annotations.

Sorry, I didn't make this entirely clear in my post: in the above example, I am assuming that my system incorrectly predicts *both* A' and B' (and writes them both to the "Annotations" array for scoring).

Now, in *both* A' and B', "at least one token for a cue has been identified correctly". So why did you choose A' (as opposed to B') when comparing gold with predicted? Is there some system behind this? Do you take just the *first* one (in the order given by the predicted "Annotations" array), or the *best* one (in the sense of maximizing Joint(F1) overall)?

Thank you and best regards,

Anton

Dear Anton,

sorry, I didn't fully understand your question initially.

In cases like you describe where you have two competing, partially overlapping annotations, our script selects the one that scores better. In other words, we select the prediction that optimizes the overall F1 for cues and roles.

Best,

Ines