About Fiction.liveBench

Fiction.live is a popular platform with a community of creative writers who use it to share and discuss their stories. The benchmark is based on a set of stories: each sample is a question about one of the stories in Fiction.live. Answering the questions requires having a theory of mind for the characters, an understanding of the chronology of events and an ability to make inferences based on implicitly stated information.

The benchmark is meant to measure how well models can handle long contexts, while being more challenging than traditional “needle in a haystack” evaluations which resemble simple recognition or retrieval.

Methodology

We source the data directly from the Fiction.liveBench leaderboard.

The benchmark consists of 36 questions about 30 stories. Multiple versions of each parent story that preserve the key details but are shorter than the original are created to span the length from a nearly shortest-possible shortest text to the original. The models are tested separately on every length of summary and the results are aggregated across all of the 30 stories and displayed based on the summary’s length.

The full report with some more information about the benchmark and discussion is available on fiction.Live.