DeepResearchBench
A benchmark of models’ ability to gather information from the internet to answer questions, testing models’ ability to find and synthesize information.
About DeepResearchBench
To complete the tasks, the model with its agent must be able to know which pages to search for and how to interpret the information it finds in order to arrive at the answer.
The benchmark contains 91 tasks of eight different types: finding or deriving a number, finding or compiling a dataset, populating a reference class, gathering evidence, validating a claim and finding the source of some information. An example task from the number finding category is “How many IPOs with an offer price of at least $5.00 were there in each year between 1980 and 2024?” Examples for each category of question are available on the website while the rest are held out from public view.
Methodology
While being evaluated, models use the ReAct framework as their architecture which allows them to behave as an agent by planning and executing actions of their choosing. Models are limited to executing 50 actions for any individual attempt at a task. The reasoning effort allowed for thinking models is set manually (i.e. not chosen by the model/agent) and are described in the paper and on our benchmarking hub.
The models are instructed to produce answers which are as comprehensive as possible. In addition to the task prompts, models are also prompted with their own selection from a list of helpful tips provided in advance.
The benchmark leverages a system called RetroSearch which serves LLM agents web pages from a previously scraped snapshot of the internet instead of live web pages to maintain a consistent environment with which to evaluate the models over time.
The score assigned to each model falls between 0 and 1, coming from scoring methods which vary by task category. The methods employed for each task are available on the website and include precision (the ratio of correct positive answers to all positive answers given), recall (the portion of true positives that were answered as such), F1 (the harmonic mean of precision and recall), and a binary score (1 point for a correct answer, 0 otherwise).
We get our data from the FutureSearch DeepResearchBench leaderboard.