Theoretically-Grounded Commonsense Reasoning (TG-CSR) is a benchmark intended to systematically evaluate machine commonsense in a few-shot question answering (QA) setting. We refer to the benchmark as theoretically grounded because each question and its candidate answer always falls under a “representational area” in the Gordon-Hobbs theory of commonsense, including “time”, “space”, “emotions” and so on. The goal of the benchmark is to assess whether a single machine commonsense model, such as a language representation model, is able to answer questions across all these different categories without too much training data.
The task format of TG-CSR (Multi-set) is multi-set QA. For each category, a set of candidate answers is provided (about 10 answers per category) and shared among all questions in the same category. However, to make it easier for large-scale models to answer questions, we “pair” each question with a single candidate answer. The task then becomes binary QA: a prediction of “yes” (encoded as 1) indicates that the answer is a good fit for the question and “no” (encoded as 0) indicates otherwise.
Being a phased benchmark, the initial dataset consists of questions with a "vacationing abroad" context, for which training, development and test question sets are provided. More details on format and submission are provided in the starting kit. TG-CSR was designed to initially be a few-shot problem, but will later be a zero-shot problem for which no training or development sets will be provided. In upcoming phases, we will be releasing three additional sets of questions with the same format and Gordon-Hobbs representational areas, but designed using three other contexts.
Please cite the following paper if you are using this benchmark:
Santos H, Shen K, Mulvehill AM, Razeghi Y, McGuinness DL, Kejriwal M. A Theoretically Grounded Benchmark for Evaluating Machine Commonsense. arXiv preprint arXiv:2203.12184. 2022 Mar 23.
Evaluation Criteria: F1-score
To maintain integrity of the test set, we ask that participants not ‘manually’ read or answer the questions in either the dev or test sets, even though we do provide labels for the dev set to tune your system.
Start: March 19, 2022, midnight
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | mayankkejriwal | 0.3934 |
2 | keshen | 0.3729 |
3 | AliceM | 0.3158 |