WeirdML (v2)

About WeirdML (v2)

Each task in WeirdML is a machine learning engineering problem, with training data and instructions describing the problem to the agent. The agent must write code to train a machine learning model on the provided data to perform the required task. The tasks include shape recognition, digit classification, other image tasks and chess game outcomes, and were designed to be novel and require careful thinking.

Methodology

We add data directly from the leaderboard on the WeirdML benchmark page.

WeirdML evaluates models through an automated pipeline that executes model-generated PyTorch code in an isolated Docker environment. For each task, the model receives a detailed prompt describing the problem and example code for data loading and prediction saving. The model then gets 5 iterations to solve the task, receiving feedback after each attempt in the form of execution logs and test accuracy.

Each task is run at least 15 times per model to account for performance variance, with some exceptions due to computational cost: o1, o1-preview, gpt4.5-preview, and claude-3.7-sonnet (thinking 8k) are run 5 times each. The final score for each task is the mean accuracy across all runs, where each run’s accuracy is the maximum achieved across its 5 iterations.

The evaluation environment enforces strict resource constraints: a TITAN V GPU with 12GB memory and a 600-second timeout per execution. For detailed methodology information and prompts, please refer to the WeirdML benchmark page.