Rank | Model | Overall F1-Scores on Human Study Set | ||
---|---|---|---|---|
Macro-Avg | Consistent | Contradict | ||
Sorted by |
Human | 81.7 | 79.5 | 83.9 |
1 DeepSeek |
DeepSeek-R1 THINK OPEN |
65.0 | 65.0 | 65.0 |
2 Google DeepMind |
Gemini-2.5-Pro THINK |
64.3 | 52.0 | 76.8 |
3 Google DeepMind |
Gemini-2.5-Flash THINK |
63.4 | 55.2 | 71.6 |
4 Alibaba Cloud |
Qwen3-235B-A22B (+RAG top-40) RAG THINK OPEN |
63.1 | 56.9 | 69.3 |
5 DeepResearch |
OpenAI DeepResearch THINK |
62.5 | 58.4 | 66.7 |
6 DeepSeek |
DeepSeek-R1 + Many-Shot ICL THINK OPEN |
62.3 | 62.0 | 62.7 |
7 Google DeepMind |
Gemini-2.5-Flash (+RAG top-40) RAG THINK |
60.7 | 45.8 | 75.6 |
8 Alibaba Cloud |
Qwen3-32B (+RAG top-40) RAG THINK OPEN |
60.5 | 60.0 | 61.0 |
9 OpenAI |
GPT-4o RAG |
60.2 | 50.8 | 69.6 |
10 Alibaba Cloud |
Qwen3-32B (SFT, +RAG top-40) SFT RAG THINK OPEN |
59.7 | 60.1 | 59.2 |
11 Google DeepMind |
Gemini-2.5-Pro + Many-Shot ICL THINK |
59.5 | 46.0 | 73.0 |
12 DeepSeek |
DeepSeek-R1 (+RAG top-40) RAG THINK OPEN |
59.1 | 42.4 | 75.9 |