About LiveBench

LiveBench aggregates tasks across categories such as reasoning, coding, mathematics, data analysis, and language. Each release fixes a task set and judging rubric, enabling apples-to-apples comparisons while reflecting evolving user needs and prompt styles.

Results are typically aggregated into category and overall scores. Because tasks are diverse and frequently refreshed, LiveBench is useful for tracking practical model utility and regression risk over time.