GDP.pdf, created by Surge AI, is a multimodal reasoning benchmark built from real-world prompts and PDFs drawn from expert professional workflows. The benchmark tests whether models can read documents such as manuals, wiring diagrams, blueprints, and technical notes, select the relevant information, and answer questions without relying on plausible but incorrect details.
Tasks are graded against rubrics that identify the required facts and common failure modes. The public benchmark page includes example tasks, a research paper, and a Hugging Face dataset link.
We source GDP.pdf results from the public Surge AI GDP.pdf leaderboard. The leaderboard reports the percentage of benchmark tasks solved correctly. Our chart plots this public score.
GDP.pdf evaluates models on 100 tasks drawn from professional PDF workflows across ten domains, including finance, healthcare, legal, engineering, construction, manufacturing and supply chain, insurance, real estate, HR, and STEM or research. Each task pairs a prompt with one or more PDFs and requires the model to locate relevant information, interpret tables, diagrams, forms, or contractual language, and synthesize an answer. Responses are graded with task-specific rubrics that identify required facts and common failure modes. The main leaderboard score reflects the share of tasks for which the model satisfies all rubric criteria.
Have a question? Noticed something wrong? Let us know.
A benchmark testing whether models can answer questions about real-world professional PDFs.