About APEX-Agents

APEX was created by Mercor to assess frontier AI models on professional tasks that would typically take seasoned practitioners hours to complete. The benchmark contains 400 test cases – 100 per domain – spanning investment banking, management consulting, big law, and primary care medicine. Each case consists of a task prompt reflecting real-world professional workflows, a set of source documents (averaging around 26,000 tokens per case), and a detailed grading rubric. The tasks and rubrics were designed by domain experts with an average of over seven years of industry experience, and each case underwent a secondary expert review for quality control.

Evaluation is rubric-based: each rubric decomposes response quality into discrete, objective criteria assessed as pass or fail – analogous to unit tests for code. Models receive each prompt eight times, and responses are graded by a judge LM against the expert-generated criteria. The final score represents the mean percentage of rubric criteria satisfied. The benchmark is designed so that tasks represent authentic, high-value deliverables – such as financial models, contract analyses, clinical assessments, and strategic recommendations – rather than abstract reasoning exercises.