About ARC AI2

The AI2 ARC benchmark contains elementary and middle-school science questions that require factual knowledge, commonsense, and multi-hop reasoning. While questions are short and recognizable to humans, they are intentionally constructed so that shallow cues are insufficient, prompting models to combine knowledge with reasoning to select the correct option.

ARC AI2 is widely used to measure science understanding under constrained context. Reported results typically aggregate across the Easy and Challenge splits, with the challenge portion emphasizing non-trivial reasoning and resistance to annotation artifacts.