CommonSenseQA 2 extends classic commonsense QA with more diverse relations and more carefully constructed distractors, making superficial pattern matching less effective. Questions target practical knowledge about objects, social situations, and cause-and-effect, emphasizing the difference between plausible and correct answers.
The benchmark is designed to better reflect real-world ambiguity and to reduce annotation artifacts that can inflate scores. As a result, high performance often indicates genuine conceptual understanding rather than exploitation of dataset biases.
Have a question? Noticed something wrong? Let us know.
A harder, bias-reduced multiple-choice benchmark that probes everyday commonsense beyond lexical shortcuts.