LongDocURL

Abstract

Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark—LongDocURL—integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

Leaderboard

#	Model	Size	Task Type			Evidence Element				Page/Element			Total
#	Model	Size	U	R	L	TXT	LAY	FIG	TAB	SP	MP	CE	Total
	GPT-4o-24-05-13 🥇 OpenAI	-	68.6	59.9	59.6	70.7	60.0	67.4	60.3	65.8	63.2	65.4	64.5
	Gemini-1.5-Pro 🥈 Google	-	55.7	43.4	46.4	58.7	50.4	50.0	41.8	48.7	52.8	49.9	50.9
	Qwen-VL-Max 🥉 Alibaba	-	58.8	43.9	36.0	58.0	40.2	52.3	44.6	51.6	47.6	48.0	49.5
	Qwen2-VL Alibaba	7B	36.9	24.8	22.6	37.7	29.7	28.6	23.7	27.2	33.6	29.9	30.6
	LLaVA-OneVision-Chat Bytedance & NTU S-Lab	7B	30.5	19.0	18.7	32.2	26.5	24.4	15.4	19.8	29.7	24.2	25.0
	LLaVA-Next-Interleave-DPO Bytedance & HKUST	7B	21.6	13.9	7.6	22.5	13.9	15.4	8.7	12.1	19.8	13.5	16.2
	Llama-3.2 Meta	11B	12.9	9.4	2.7	11.8	6.9	8.7	6.3	7.9	10.3	6.8	9.2

Statistics

Inner: divided by the primary task categories (Understanding, Reasoning, and Locating).
Middle: divided by the number of answer evidence pages (Single-Page, Multi-Page),
and the number of types of evidence elements (Cross-Element).
Outer: divided by the types of evidence elements (Text, Table, Figure, Layout).

The statistical analysis of our dataset about the distribution characteristics across
(a) document pages, (b) answer evidence page, (c) evidence pages length, (d) document sources, and (e) evidence element types.

Benchmark Comparison

Comparison with other datasets in average pages and text tokens per document.

The dataset attributes comparison between our LongDocURL and MMLongBench-Doc.

Comparison between LongDocURL and previous document understanding datasets.
Task types: (U)nderstanding, (R)easoning, and (L)ocating.

Construction Pipeline

Overview of our semi-automated construction pipeline. The pipeline comprises four modules:
(a) Extract & Filter; (b) QA Generation; (c) Automated Verification; (d) Human Verification.

Main Results

Generalized accuracy scores (0~1) on LongDocURL. There are 3 types of tasks: (U)nderstanding, (R)easoning, and (L)ocating. There are 4 types of evidence elements: pure text (TXT), layout (LAY), chart & image (FIG), and table (TAB). There are 3 types of evidence pages/elements: single-page (SP), multi-page (MP), and cross-element (CE). CTi: Cross-Title, CTa: Cross-Table, PTi: Para-Title, FTa: Figure-Table. The highest scores among models in each section are highlighted in green.

Fine-grained Results

We choose 3 proprietary and 3 open-source models to conduct further analysis based on
(Left) task types, document elements, evidence pages, and
(Right) document sources.

Ablation of Input Paradigms

Comparison among different input paradigms on a subset of 20% data.

Data Examples

(Top) Thumbnail of a document example. Orange boxes indicate answer evidence pages.
(Bottom) Data examples generated from the document and screenshots of relevant part of answer evidence pages.

Case 1. Evidence source: ["Table"]. Evidence pages: [26, 27].
The correct extracted information and reasoning are colored in green, and the wrong ones are colored in red.

Case 2. Evidence source: ["Table"]. Evidence pages: [110, 111].
The correct extracted information and reasoning are colored in green, and the wrong ones are colored in red.

Data Example of Understanding QA.

Data Example of Reasoning QA.

Data Example of Locating QA.

LongDocURL

a Comprehensive Multimodal Long Document Benchmark
Integrating Understanding, Reasoning, and Locating

Abstract

Leaderboard

Benchmark

Statistics

Benchmark Comparison

Construction Pipeline

Experiment Results

Main Results

Fine-grained Results

Ablation of Input Paradigms

Data Examples

Data Examples

LongDocURL

a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

Abstract

Leaderboard

Benchmark

Statistics

Benchmark Comparison

Construction Pipeline

Experiment Results

Main Results

Fine-grained Results

Ablation of Input Paradigms

Data Examples

Data Examples

a Comprehensive Multimodal Long Document Benchmark
Integrating Understanding, Reasoning, and Locating