PostTrainBench

PostTrainBench evaluates CLI agents on a constrained post-training task: given a small base language model (1–4B parameters), a single H100 GPU, and a 10-hour time window, the agent must improve the model’s performance through post-training techniques of its own choosing. The agent has complete freedom in its approach, including what data to use, how to fine-tune, and how to allocate compute. No predefined strategy, starter code, or human interaction is permitted. Results are aggregated across four base models and seven downstream evaluation benchmarks, so the final score reflects broad post-training competence rather than performance on any single model or task.

Methodology

We source PostTrainBench results from the public PostTrainBench leaderboard. The leaderboard reports an average score for each submitted method and also records the scaffold or agent framework used for each run. Our chart plots the average leaderboard score and exposes scaffold information in the tooltip.

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

PostTrainBench

PostTrainBench

Methodology

PostTrainBench

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

PostTrainBench

PostTrainBench

Methodology