BALROG
Methodology
We source Balrog evaluation results directly from the official Balrog leaderboard, maintained by the authors of the original paper. You can also find the evaluation code used to produce these results on the Balrog GitHub repository.
Some of the environments contain multiple tasks: 5 tasks for BabyAI, 40 for Baba is AI, 3 for TextWorld, and 4 for MiniHack. For each environment and task, models are evaluated between 5 and 25 times with different random seeds.
Each episode involves multiple interaction steps (from dozens in BabyAI to potentially thousands in NetHack) where the agent receives an observation and must produce a valid action. At each step, models receive a structured input containing a description of valid actions for the environment (e.g., “north: move north”, “eat: eat something”), environment-specific instructions, and the current observation formatted as text. For example, in BabyAI, text observations include statements like “You see a key 1 step to the right and 2 steps forward” or “You see a closed door leading north.” In more complex environments like NetHack, the text observations include detailed descriptions of surroundings, inventory items, and player status.
The benchmark scores each model using a progression metric ranging from 0 to 100, which is averaged over runs:
- For BabyAI, MiniHack, and Baba Is AI, the model receives a binary score of 0 or 100 based on whether it completed the task.
- For Crafter, the score represents the percentage of milestones achieved (e.g., discovering resources, crafting tools)
- For TextWorld, the score between 0 and 100 is based on subtasks completed toward the goal.
- For NetHack, the authors introduce a novel progression metric based on the highest dungeon level (1-50) and experience level (1-30) achieved by the model.
For detailed methodology information including specific prompts and evaluation settings, please refer to the Balrog paper and evaluation code.