DeepResearchBench

Externally evaluated

Methodology

While being evaluated, models use the ReAct framework as their architecture which allows them to behave as an agent by planning and executing actions of their choosing. Models are limited to executing 50 actions for any individual attempt at a task. The reasoning effort allowed for thinking models is set manually (i.e. not chosen by the model/agent) and are described in the paper and on our benchmarking hub.

The models are instructed to produce answers which are as comprehensive as possible. In addition to the task prompts, models are also prompted with their own selection from a list of helpful tips provided in advance.

The benchmark leverages a system called RetroSearch which serves LLM agents web pages from a previously scraped snapshot of the internet instead of live web pages to maintain a consistent environment with which to evaluate the models over time.

The score assigned to each model falls between 0 and 1, coming from scoring methods which vary by task category. The methods employed for each task are available on the website and include precision (the ratio of correct positive answers to all positive answers given), recall (the portion of true positives that were answered as such), F1 (the harmonic mean of precision and recall), and a binary score (1 point for a correct answer, 0 otherwise).

We get our data from the FutureSearch DeepResearchBench leaderboard.