METR Time Horizons

Externally evaluated

Methodology

We source time horizons directly from METR’s own analysis.

The methodology to estimate a model’s time horizon is as follows:

  1. Collect performance data: For each of HCAST, RE-Bench, and SWAA, for each task, evaluate the model’s performance. Each model was run around 8 times on each task. Most models are evaluated with METR’s modular-public agent scaffold, although the scaffold was slightly modified for o1 and o1-preview. Then, run human baselining experiments to obtain human completion times for these tasks.

  2. Estimate time horizon: For each AI model, fit a logistic regression curve that predicts the probability of task success based on the logarithm of the human completion time for that task. The model’s 50% time horizon is the human task completion time at which the fitted logistic curve for a given model intersects the 50% success probability threshold. This metric represents the estimated time (in minutes or hours) that a human expert would typically take to complete tasks which the AI model can complete with a 50% success rate.

For full details on the task suites, human baselining, agent setups, and curve fitting, please refer to METR’s papers Measuring AI Ability to Complete Long Tasks, RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts, and HCAST: Human-Calibrated Autonomy Software Tasks.