BBH
A curated subset of BIG-bench tasks that remain difficult for large language models, probing compositional, symbolic, and multi-step reasoning.
About BBH
BIG-Bench Hard (BBH) collects tasks where earlier large models performed poorly despite scaling trends, including logical deduction, algorithmic reasoning, causal inference, and mathematics. The problems are crafted to avoid shortcuts and reward models that can plan, chain intermediate steps, and apply rules compositionally.
BBH is frequently used in model reports as a stress test for reasoning. Improvements on BBH often correlate with better tool use, longer-context handling, and the ability to integrate intermediate computations (e.g., scratchpads or chain-of-thought) when permitted.