OS World
A computer-use benchmark where agents complete real desktop and web tasks in reproducible OS environments using keyboard/mouse actions and structured UI observations.
About OS World
OSWorld evaluates computer-use agents in real OS environments running arbitrary desktop and web apps. Agents interact through human-like keyboard/mouse primitives (click, type, scroll, drag, wait) and receive UI state via structured observations, such as accessibility trees.
Tasks emphasize grounded interaction and long-horizon planning (file operations, app configuration, information lookup, multi-app workflows). Success is measured via execution-based checks that validate end-state goals and constraints.