About BBH

BIG-Bench Hard (BBH) collects tasks where earlier large models performed poorly despite scaling trends, including logical deduction, algorithmic reasoning, causal inference, and mathematics. The problems are crafted to avoid shortcuts and reward models that can plan, chain intermediate steps, and apply rules compositionally.

BBH is frequently used in model reports as a stress test for reasoning. Improvements on BBH often correlate with better tool use, longer-context handling, and the ability to integrate intermediate computations (e.g., scratchpads or chain-of-thought) when permitted.