GDP.pdf

GDP.pdf, created by Surge AI, is a multimodal reasoning benchmark built from real-world prompts and PDFs drawn from expert professional workflows. The benchmark tests whether models can read documents such as manuals, wiring diagrams, blueprints, and technical notes, select the relevant information, and answer questions without relying on plausible but incorrect details.

Tasks are graded against rubrics that identify the required facts and common failure modes. The public benchmark page includes example tasks, a research paper, and a Hugging Face dataset link.

Methodology

We source GDP.pdf results from the public Surge AI GDP.pdf leaderboard. The leaderboard reports the percentage of benchmark tasks solved correctly. Our chart plots this public score.

GDP.pdf evaluates models on 100 tasks drawn from professional PDF workflows across ten domains, including finance, healthcare, legal, engineering, construction, manufacturing and supply chain, insurance, real estate, HR, and STEM or research. Each task pairs a prompt with one or more PDFs and requires the model to locate relevant information, interpret tables, diagrams, forms, or contractual language, and synthesize an answer. Responses are graded with task-specific rubrics that identify required facts and common failure modes. The main leaderboard score reflects the share of tasks for which the model satisfies all rubric criteria.

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

GDP.pdf

GDP.pdf

Methodology

GDP.pdf

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

GDP.pdf

GDP.pdf

Methodology