PostTrainBench

PostTrainBench

PostTrainBench evaluates CLI agents on a constrained post-training task: given a small base language model (1–4B parameters), a single H100 GPU, and a 10-hour time window, the agent must improve the model’s performance through post-training techniques of its own choosing. The agent has complete freedom in its approach, including what data to use, how to fine-tune, and how to allocate compute. No predefined strategy, starter code, or human interaction is permitted. Results are aggregated across four base models and seven downstream evaluation benchmarks, so the final score reflects broad post-training competence rather than performance on any single model or task.

Methodology

We source PostTrainBench results from the public PostTrainBench leaderboard. The leaderboard reports an average score for each submitted method and also records the scaffold or agent framework used for each run. Our chart plots the average leaderboard score and exposes scaffold information in the tooltip.