I'm working on the system description paper, and I was wondering if you could (or maybe you already have) publish the evaluation code? I know it uses weighted F1, but for things like span-based metrics I'm not sure if you treat each span as a single entity or each token.
If you could share it, that would be great, thanks!
Posted by: pughrob @ June 12, 2023, 7:18 p.m.Hi,
Evaluation for task 1 is the weighted metric.
Evaluation for tasks 2 and 3 is at span level and it's strict: we consider the expected spans with indexes <i1, j1> and the submission spans <i2, j2> and expect an exact match. We measure precision, recall and F1 of those two sets.
In the output of your submissions, you can also see other metrics besides the main metric of each task: precisions, recalls, unlabeled metrics, etc.
Regards,
Luis