VPCT
Fiction.liveBench evaluates models’ ability to understand longform fiction writing. Fiction.live is a popular platform with a community of creative writers who use it to share and discuss their stories. The benchmark is based on a set of stories: each sample is a question about one of the stories in Fiction.live. Answering the questions requires having a theory of mind for the characters, an understanding of the chronology of events and an ability to make inferences based on implicitly stated information.
The benchmark is meant to measure how well models can handle long contexts, while being more challenging than traditional “needle in a haystack” evaluations which resemble simple recognition or retrieval. For example, a model which could identify a specific word in the text but can’t understand enough about the story to answer a question about a character’s state of mind would thus get a score on Fiction.liveBench that better represents its long-context comprehension abilities.
Fiction.liveBench was created by kas and published on fiction.Live.
Methodology
We add data directly from the Visual Physics Comprehension Test web page.
The prompt used for the evaluation is the following:
You are an expert physics simulator. Looking at this image of a ball-and-bucket physics simulation, predict which bucket (numbered 1, 2, or 3 from left to right) the ball will eventually fall into.
Let’s think about this step by step:
- First, observe the initial position of the ball
- Note any obstacles or lines drawn that will affect the ball’s path
- Consider how gravity will affect the ball’s trajectory
- Think about how the ball will bounce and roll along the surfaces
- Analyze how the placement and angle of each line will guide the ball
- Factor in that the ball has some elasticity and will bounce slightly when it hits surfaces
Based on your analysis, please conclude with a clear answer in this format: ‘answer(X)’ where X is the bucket number (1, 2, or 3).
Explain your reasoning, then end with your answer in the specified format.
Reported model scores are averages over different numbers of runs, depending on the model (several models are evaluated with 2-3 runs but others are evaluated with just a single run). The evaluation code is accessible here.