The tasks considered include
The models’ final duration is estimated by fitting a logistic model of success rate against human task completion time, and reporting the task duration at which the model is predicted to succeed 50% of the time. METR also reports an 80% time horizon, but we do not currently collect or update that series here.
We source time horizons directly from METR’s own analysis. The live leaderboard is available at metr.org/time-horizons.
The methodology to estimate a model’s time horizon is as follows:
Collect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model’s performance. Each model was run around 8 times on each task. Most models are evaluated with METR’s modular-public agent scaffold, although the scaffold was slightly modified for o1 and o1-preview. Then, run human baselining experiments to obtain human completion times for these tasks.
Estimate time horizon: For each AI model, fit a logistic regression curve that predicts the probability of task success based on the logarithm of the human completion time for that task. The model’s 50% time horizon is the human task completion time at which the fitted logistic curve for a given model intersects the 50% success probability threshold. This metric represents the estimated time (in minutes or hours) that a human expert would typically take to complete tasks which the AI model can complete with a 50% success rate.
For full details on the task suites, human baselining, agent setups, and curve fitting, please refer to METR’s papers Measuring AI Ability to Complete Long Tasks, RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts, and HCAST: Human-Calibrated Autonomy Software Tasks.
Have a question? Noticed something wrong? Let us know.
Durations of the longest task that models can complete correctly more often than not, across a set of software engineering and related tasks.