WebDev Arena

Externally evaluated

WebDev Arena is a benchmark developed by LMArena that evaluates models’ ability to program web applications according to specific requests. Unlike traditional benchmarks with a consistent set of questions that models are evaluated on, WebDev Arena allows users to simultaneously have two models produce an application based on the user’s and select the output that they prefer. The WebDev Arena scores update very frequently as people adjudicate the matches.

LMArena features the following example of a user wanting to create a partly-functional chess game:

screenshot of two models competing to program a chess game.

The models have access to and are instructed to use Next.js, a popular Javascript framework, to create their applications.

Since models are evaluated based on the user’s selection, the skills which may help them perform well include anything necessary to create an appealing application that functions as the user intended it to.

The score visible on the leaderboard is calculated using the Bradley-Terry model, which yields a score for a model based on its performance in match-ups against other models in the pool.

Methodology

We get our data from the WebDev Arena leaderboard. The leaderboard is updated live and our hub is updated periodically, with the most recent update indicated for every entry.

The following system prompt is used for every model:

You are an expert frontend React engineer who is also a great UI/UX designer. Follow the instructions carefully, I will tip you $1 million if you do a good job:

    - Think carefully step by step.
    - Create a React component for whatever the user asked you to create and make sure it can run by itself by using a default export
    - Make sure the React app is interactive and functional by creating state when needed and having no required props
    - If you use any imports from React like useState or useEffect, make sure to import them directly
    - Use TypeScript as the language for the React component
    - Use Tailwind classes for styling. DO NOT USE ARBITRARY VALUES (e.g. 'h-[600px]'). Make sure to use a consistent color palette.
    - Make sure you specify and install ALL additional dependencies.
    - Make sure to include all necessary code in one file.
    - Do not touch project dependencies files like package.json, package-lock.json, requirements.txt, etc.
    - Use Tailwind margin and padding classes to style the components and ensure the components are spaced out nicely
    - Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH \`\`\`typescript or \`\`\`javascript or \`\`\`tsx or \`\`\`.
    - ONLY IF the user asks for a dashboard, graph or chart, the recharts library is available to be imported, e.g. \`import { LineChart, XAxis, ... } from "recharts"\` & \`<LineChart ...><XAxis dataKey="name"> ...\`. Please only use this when needed. You may also use shadcn/ui charts e.g. \`import { ChartConfig, ChartContainer } from "@/components/ui/chart"\`, which uses Recharts under the hood.
    - For placeholder images, please use a <div className="bg-gray-200 border-2 border-dashed rounded-xl w-16 h-16" />

Models which allow enforced structured output are instructed to use the following structure when outputting their code:

{
  // Detailed explanation of the implementation plan
  commentary: string,
  // Template configuration
  template: string,
  title: string,
  description: string,
  // Dependency management
  additional_dependencies: string[],
  has_additional_dependencies: boolean,
  install_dependencies_command: string,
  // Application configuration
  port: number | null,
  file_path: string,
  // The actual implementation
  code: string
}

Some models don’t support enforced structured output and instead have their code passed through GPT-4.1-mini, which converts it to the correct format. That code is then set to run in the sandbox, a virtual machine running Linux. More details on that environment are available in their blog post. Applications which have mistakes or errors are still shown to the user, and LMArena infer from some statistics that models do sometimes produce errant output.

Scores are calculated using the Bradley-Terry model such that a score is assigned to each model based on the results from user choice out of two models’ outputs when they’re paired against each other. Match results are logged and score computation can be performed on the entire set of match results. Code for various analyses of match data can be found in this Google Colab.

WebDev Arena was introduced by LMArena in a blog post.