CritPt

CritPt

CritPt (short for “Complex Research using Integrated Thinking - Physics Test,” and a play on “critical point”) is the first benchmark to test whether AI can reason through complex, open-ended, research-level physics problems. Created by more than 50 active physics researchers across over 30 institutions, led by Argonne National Laboratory and the University of Illinois, its problems are modeled on entry-level original research projects rather than textbook exercises and span around a dozen modern physics subfields. All problems are unpublished and original to avoid contamination.

Methodology

We source results from the public CritPt leaderboard, hosted by Artificial Analysis.

CritPt contains 71 composite research challenges decomposed into roughly 190 checkpoint sub-tasks, created and reviewed by domain experts with an average of more than 40 hours of expert review per challenge. Answers use guess-resistant, machine-verifiable formats such as numerical arrays, symbolic expressions, and Python functions, scored by an automated physics-customized grading pipeline. The headline metric is accuracy; scores are very low, reflecting the benchmark’s difficulty.

For full details, see the CritPt paper and code.