In the test set, there are 1364 questions: 182 of type MC, 356 of type Numerical, and 826 of type T/F.
Yet the final metric for leaderboard ranking is a plain sum of the MC score, the Numerical score, and the T/F score.
What is the rationale behind not re-weighing scores to account for the number of questions in each type? T/F score should have a higher weight than MC one.
Hi. Thank you for the question. We wanted to encourage progress on all three types of questions which we think deserve equal attention. Weighing would make T/F performance dominate the overall score in this case, which may not be desirable.Posted by: Autocast @ Dec. 8, 2022, 9:32 p.m.
Ok makes sense, thank you for clarifying.Posted by: matravox @ Dec. 9, 2022, 5:38 a.m.