Fiction.liveBench
Fiction.liveBench evaluates models’ ability to understand longform fiction writing. Fiction.live is a popular platform with a community of creative writers who use it to share and discuss their stories. The benchmark is based on a set of stories: each sample is a question about one of the stories in Fiction.live. Answering the questions requires having a theory of mind for the characters, an understanding of the chronology of events and an ability to make inferences based on implicitly stated information.
The benchmark is meant to measure how well models can handle long contexts, while being more challenging than traditional “needle in a haystack” evaluations which resemble simple recognition or retrieval. For example, a model which could identify a specific word in the text but can’t understand enough about the story to answer a question about a character’s state of mind would thus get a score on Fiction.liveBench that better represents its long-context comprehension abilities.
Fiction.liveBench was created by kas and published on fiction.Live.
Methodology
We source the data directly from the Fiction.liveBench leaderboard.
The benchmark consists of 36 questions about 30 stories. Multiple versions of each parent story that preserve the key details but are shorter than the original are created to span the length from a nearly shortest-possible shortest text to the original. The models are tested separately on every length of summary and the results are aggregated across all of the 30 stories and displayed based on the summary’s length.
The full report with some more information about the benchmark and discussion is available on fiction.Live.