Hi organizers,

In the test set, there are 1364 questions: 182 of type MC, 356 of type Numerical, and 826 of type T/F.
Yet the final metric for leaderboard ranking is a plain sum of the MC score, the Numerical score, and the T/F score.
What is the rationale behind not re-weighing scores to account for the number of questions in each type? T/F score should have a higher weight than MC one.

Posted by: matravox @ Dec. 8, 2022, 2:51 a.m.

Hi. Thank you for the question. We wanted to encourage progress on all three types of questions which we think deserve equal attention. Weighing would make T/F performance dominate the overall score in this case, which may not be desirable.

Posted by: Autocast @ Dec. 8, 2022, 9:32 p.m.

Ok makes sense, thank you for clarifying.

Posted by: matravox @ Dec. 9, 2022, 5:38 a.m.
