METR Time Horizons
Durations of the longest task that models can complete correctly more often than not, across a set of software engineering and related tasks.
About METR Time Horizons
The tasks considered include
- RE-Bench, a set of machine learning research engineering tasks,
- HCAST, a more general set of challenging software engineers tasks, including ML engineering, and
- SWAA, a set of smaller tasks that involve operating computer software
The models’ final duration is assigned based on the longest task that the model can complete. The duration of the task for that purpose is judged by the time it takes a human to complete that task.
Methodology
We source time horizons directly from METR’s own analysis.
The methodology to estimate a model’s time horizon is as follows:
-
Collect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model’s performance. Each model was run around 8 times on each task. Most models are evaluated with METR’s modular-public agent scaffold, although the scaffold was slightly modified for o1 and o1-preview. Then, run human baselining experiments to obtain human completion times for these tasks.
-
Estimate time horizon: For each AI model, fit a logistic regression curve that predicts the probability of task success based on the logarithm of the human completion time for that task. The model’s 50% time horizon is the human task completion time at which the fitted logistic curve for a given model intersects the 50% success probability threshold. This metric represents the estimated time (in minutes or hours) that a human expert would typically take to complete tasks which the AI model can complete with a 50% success rate.
For full details on the task suites, human baselining, agent setups, and curve fitting, please refer to METR’s papers Measuring AI Ability to Complete Long Tasks, RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts, and HCAST: Human-Calibrated Autonomy Software Tasks.