GDPVal evaluates models on workplace tasks drawn from nine sectors of the U.S. economy, including finance and insurance, government, healthcare, manufacturing, and information. These sectors wereselected because they collectively account for the largest share of U.S. GDP. Within each sector, the benchmark covers the five highest-earning predominantly digital occupations, for a total of 44 occupations and 1,320 tasks.
Tasks represent actual work deliverables, such as documents, spreadsheets, and presentations. Scoring uses blinded pairwise comparisons: a domain expert sees only the task and two unlabeled deliverables (the model’s output and a human expert’s), and ranks them without knowing which is which. The result is a win, tie, or loss for the model against the human baseline.
We source GDPVal results from the public GDPVal leaderboard. The public leaderboard reports at least two aggregate metrics: a win-rate metric and a wins-plus-ties metric. For this page, the default chart uses the public win-rate metric, and the dropdown exposes the wins-plus-ties metric when available in the exported data.
Have a question? Noticed something wrong? Let us know.
A benchmark measuring model performance on well-specified tasks drawn from selected real-world occupations.