PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
1 WeChat AI 2 HKUST 3 CUHK 4 NJIT
= equal contribution
Spoiler alert: We show it is possible to measure Fluid Intelligence in natural language space.
Illustration of two examples from by our PRELUDE task that are Consistent and Contradict to the canonical book, respectively.

Summary of Our Research

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. A comprehensive study on our task demonstrates that:

  1. In-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%;
  2. Models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans;
  3. Recent improvements in LLMs' general reasoning capabilities do not necessarily lead to better long-context reasoning, with a notable performance drop when context is provided;
  4. Combined with the observation that DeepResearch performs poorly on our task, it shows that the task cannot be solved simply by retrieving existing information from the web. Instead, it requires generating new knowledge through reasoning based on learned rules.

Definitions of Our Annotation Labels

Comparison to Existing Story Understanding Benchmarks

We compare our PRELUDE with existing popular benchmarks on the following essential criteria for long-context understanding and reasoning.

  1. Beyond Memorization: LLMs memorize content from pretraining, especially for popular texts, enabling answers without true comprehension. As a Necessity Condition, a robust benchmark must prevent solutions based on memorization alone, ensuring full-context reasoning remains essential.
  2. Global Dependency: The task should require aggregating evidence scattered across the context or exhibiting global dependencies; otherwise, it reduces to a short-context problem focused on retrieval rather than true long-text understanding.
  3. Depth of Reasoning: Long-context reasoning should inherently require synthesizing multiple pieces of evidence and multi-step deduction, instaed of shallow reasoning, such as decomposition or enumeration.
  4. Human-Machine Gap: To highlight essential capabilities that general-purpose intelligent systems should possess, a benchmark should show a significant gap between humans and machines.
  5. Beyond Summarization/Salience: A strong benchmark should require attention to fine-grained details beyond high-level abstraction to remain challenging and meaningful. Otherwise, it risks reducing to a summarization task that is solvable without long-context understanding.