BALROG
A benchmark which evaluates models on their ability to play a series of games with widely varying difficulties.
About BALROG
Models completing the benchmark drive an agent which plays the games by taking actions in the environment. The series comprises
- BabyAI, an easy grid-based game where the player completes basic tasks that involve moving around and executing one of a small set of actions like picking up an in-game item.
- Crafter, a grid-based game where the player has a limited amount of health and must survive and replenish themselves by finding and making everything they need in the game environment.
- The three games from the TextWorld environment that are used include exploration for treasure, cooking and finding coins.
- Baba is AI, a grid-based game governed by rules written in natural language, requires the player to move around to complete different tasks. Notably, the rules can be rewritten by the player during the game, which gives the agent an opportunity to use a higher-level of understanding of the game to its advantage.
- NetHack Learning Environment (NLE) is an environment based on the classic terminal video game NetHack, which has a high competence ceiling for human players and requires huge amounts of practice to become proficient in. The game involves many levels wherein the player has to overcome obstacles like puzzles and battles with monsters while navigating towards a prized item and bringing it back to the starting point.
- MiniHack is a framework that Balrog based on the NetHack Learning Environment and is employed in Balrog for a subset of challenging games. Completing these games requires many of the same abilities while being achievable in far fewer moves than NLE and requiring less time - hours instead of years - for humans to master.
Methodology
We source Balrog evaluation results directly from the official Balrog leaderboard, maintained by the authors of the original paper. You can also find the evaluation code used to produce these results on the Balrog GitHub repository.
Some of the environments contain multiple tasks: 5 tasks for BabyAI, 40 for Baba is AI, 3 for TextWorld, and 4 for MiniHack. For each environment and task, models are evaluated between 5 and 25 times with different random seeds.
Each episode involves multiple interaction steps (from dozens in BabyAI to potentially thousands in NetHack) where the agent receives an observation and must produce a valid action. At each step, models receive a structured input containing a description of valid actions for the environment (e.g., “north: move north”, “eat: eat something”), environment-specific instructions, and the current observation formatted as text. For example, in BabyAI, text observations include statements like “You see a key 1 step to the right and 2 steps forward” or “You see a closed door leading north.” In more complex environments like NetHack, the text observations include detailed descriptions of surroundings, inventory items, and player status.
The benchmark scores each model using a progression metric ranging from 0 to 100, which is averaged over runs:
- For BabyAI, MiniHack, and Baba Is AI, the model receives a binary score of 0 or 100 based on whether it completed the task.
- For Crafter, the score represents the percentage of milestones achieved (e.g., discovering resources, crafting tools)
- For TextWorld, the score between 0 and 100 is based on subtasks completed toward the goal.
- For NetHack, the authors introduce a novel progression metric based on the highest dungeon level (1-50) and experience level (1-30) achieved by the model.
For detailed methodology information including specific prompts and evaluation settings, please refer to the Balrog paper and evaluation code.