GeoBench

Externally evaluated

GeoBench tests whether AI models can determine where in the world a photo was taken. This challenge is inspired by GeoGuessr, the online game where players examine Street View imagery to pinpoint locations on a map by hunting for clues like road signs, vegetation, architecture, and license plates.

The benchmark presents models with photographs from known locations drawn from five actual GeoGuessr community maps: “A Community World”, “A Varied World”, Urban environments (cities and metropolitan areas), Rural environments (countryside and remote areas), and Photos (non-street view user-uploaded photos). Models must provide both latitude/longitude coordinates and a country guess, with scoring based on geographic distance (in kilometers) and country accuracy.

GeoBench evaluates core multimodal capabilities: visual recognition of geographic features (vegetation, architecture, infrastructure), text comprehension (reading signs in various scripts), spatial reasoning, and the integration of multiple visual cues to form location hypotheses. Geobench was introduced on the CCMDI blog.

Methodology

We source the data from the Geobench leaderboard and include models that the leaderboard considers to be deprecated, capturing every model that the Geobench team has evaluated.

The models are tested by providing the images and a basic prompt that encourages them to look for some visual clues in the images and then specifically requests a structured output with the latitude, longitude and country guesses. If it’s possible to configure the temperature, it’s set to 0.4.

For the map “A Common World”, most of the models were evaluated on 100 images, though some of the more expensive models were tested on smaller subsets of as few as 20 of those images. Benchmarks for other maps used subsets containing similar numbers of images.

Distances used at any point in score calculations correspond to the great-circle distance (the length of the shortest arc between the two points on a sphere’s surface) computed using the haversine formula. The highest score one can receive for an individual round is 5000 points, which is achieved by a guess-to-location distance less than the larger of either 25 meters or 0.001% the size of the furthest two points in the set of locations. Any less accurate guesses receive scores based on an exponential decay formula to penalize the distance to the location.

The full prompts, score calculation and other scaffolding are available on the Geobench Github repository and described on CCMDI’s Geobench blog post.