BBH

BIG-Bench Hard (BBH) collects tasks where earlier large models performed poorly despite scaling trends, including logical deduction, algorithmic reasoning, causal inference, and mathematics. The problems are crafted to avoid shortcuts and reward models that can plan, chain intermediate steps, and apply rules compositionally.

BBH is frequently used in model reports as a stress test for reasoning. Improvements on BBH often correlate with better tool use, longer-context handling, and the ability to integrate intermediate computations (e.g., scratchpads or chain-of-thought) when permitted.

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

BBH

BBH

BBH

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

BBH

BBH