PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
1 WeChat AI 2 HKUST 3 CUHK 4 NJIT
= equal contribution
The variation in ranks between the two sets is due to differences in the proportion of the “Contradict” labels in each set.
Rank Model Overall F1-Scores on Human Study Set
Macro-Avg Consistent Contradict

Sorted by
Macro-Avg

Human 81.7 79.5 83.9

1

DeepSeek

DeepSeek-R1
THINK OPEN
65.0 65.0 65.0

2

Google DeepMind

Gemini-2.5-Pro
THINK
64.3 52.0 76.8

3

Google DeepMind

Gemini-2.5-Flash
THINK
63.4 55.2 71.6

4

Alibaba Cloud

Qwen3-235B-A22B (+RAG top-40)
RAG THINK OPEN
63.1 56.9 69.3

5

DeepResearch

OpenAI DeepResearch
THINK
62.5 58.4 66.7

6

DeepSeek

DeepSeek-R1 + Many-Shot ICL
THINK OPEN
62.3 62.0 62.7

7

Google DeepMind

Gemini-2.5-Flash (+RAG top-40)
RAG THINK
60.7 45.8 75.6

8

Alibaba Cloud

Qwen3-32B (+RAG top-40)
RAG THINK OPEN
60.5 60.0 61.0

9

OpenAI

GPT-4o
RAG
60.2 50.8 69.6

10

Alibaba Cloud

Qwen3-32B (SFT, +RAG top-40)
SFT RAG THINK OPEN
59.7 60.1 59.2

11

Google DeepMind

Gemini-2.5-Pro + Many-Shot ICL
THINK
59.5 46.0 73.0

12

DeepSeek

DeepSeek-R1 (+RAG top-40)
RAG THINK OPEN
59.1 42.4 75.9