WeirdML

Externally evaluated

The WeirdML benchmark evaluates models’ ability to solve novel machine learning tasks that require careful thinking and understanding rather than applying standard recipes. It tests models across six diverse tasks: shape recognition (easy and hard variants), image patch reconstruction (easy and hard variants), chess game outcome prediction, and semi-supervised digit classification.

For example, in the Shapes (Easy) task, models must identify one of five shapes (circle, square, triangle, pentagon, star) from a set of 512 2D coordinates, where only some points belong to the shape and others are noise. The challenge lies in developing a way to encode the data that is invariant to point permutations and can effectively combine information from many points.

Each model gets 5 attempts per task, with feedback between attempts, so the benchmark measures both initial problem-solving ability and capacity to learn from execution results.

Methodology

We add data directly from the leaderboard on the WeirdML benchmark page.

WeirdML evaluates models through an automated pipeline that executes model-generated PyTorch code in an isolated Docker environment. For each task, the model receives a detailed prompt describing the problem and example code for data loading and prediction saving. The model then gets 5 iterations to solve the task, receiving feedback after each attempt in the form of execution logs and test accuracy.

Each task is run at least 15 times per model to account for performance variance, with some exceptions due to computational cost: o1, o1-preview, gpt4.5-preview, and claude-3.7-sonnet (thinking 8k) are run 5 times each. The final score for each task is the mean accuracy across all runs, where each run’s accuracy is the maximum achieved across its 5 iterations.

The evaluation environment enforces strict resource constraints: a TITAN V GPU with 12GB memory and a 600-second timeout per execution. For detailed methodology information and prompts, please refer to the WeirdML benchmark page.