Vending-Bench 2, created by Andon Labs, measures an AI agent’s ability to stay coherent and run a business profitably over very long horizons. The agent autonomously operates a simulated vending machine business for a full simulated year, managing inventory, orders, pricing, supplier negotiation, daily fees, and disruptions, over a context spanning many millions of tokens. It targets long-horizon agentic reliability rather than single-shot reasoning, and builds on lessons from real-world deployments.
We source results from the public Vending-Bench 2 leaderboard.
Each model runs over a full simulated year, and the headline score is averaged across five runs per model. Compared with the original Vending-Bench, version 2 adds real-world messiness such as adversarial suppliers, delayed or failed deliveries, and customer refund demands, and streamlines scoring to a single headline metric: the agent’s money balance in U.S. dollars at the end of the year (higher is better). Andon Labs estimates that a strong human strategy could reach roughly $63,000 per year, so even top models capture only a small fraction of skilled-human performance.
For full details, see the original Vending-Bench paper.
Have a question? Noticed something wrong? Let us know.
A benchmark measuring an AI agent's ability to stay coherent and run a simulated vending machine business profitably over a full simulated year.