METR Time Horizons

Dashboard Benchmarks Models ECI Data About

About METR Time Horizons

The tasks considered include

RE-Bench, a set of machine learning research engineering tasks,
HCAST, a more general set of challenging software engineers tasks, including ML engineering, and
SWAA, a set of smaller tasks that involve operating computer software

The models’ final duration is assigned based on the longest task that the model can complete. The duration of the task for that purpose is judged by the time it takes a human to complete that task.

Methodology

We source time horizons directly from METR’s own analysis.

The methodology to estimate a model’s time horizon is as follows:

Collect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model’s performance. Each model was run around 8 times on each task. Most models are evaluated with METR’s modular-public agent scaffold, although the scaffold was slightly modified for o1 and o1-preview. Then, run human baselining experiments to obtain human completion times for these tasks.
Estimate time horizon: For each AI model, fit a logistic regression curve that predicts the probability of task success based on the logarithm of the human completion time for that task. The model’s 50% time horizon is the human task completion time at which the fitted logistic curve for a given model intersects the 50% success probability threshold. This metric represents the estimated time (in minutes or hours) that a human expert would typically take to complete tasks which the AI model can complete with a 50% success rate.

For full details on the task suites, human baselining, agent setups, and curve fitting, please refer to METR’s papers Measuring AI Ability to Complete Long Tasks, RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts, and HCAST: Human-Calibrated Autonomy Software Tasks.