Table/box layout causes wrong reading order
GitHub issue: #1863 PDF: understanding-llms.pdfStatus: Open Categories: parsing, chunking, content-pillars Deck name: 05-how-llms-work Document type: Academic paper (arXiv preprint, Elsevier format)
Table/box layout causes wrong reading order
GitHub issue: #1863 PDF: understanding-llms.pdfStatus: Open Categories: parsing, chunking, content-pillars Deck name: 05-how-llms-work Document type: Academic paper (arXiv preprint, Elsevier format)
Problem
When a PDF has an ARTICLE INFO / ABSTRACT side-by-side table on page 1, docling reads the left column (ARTICLE INFO) first, jumps down past the table to the Introduction, then comes back up to read the right column (Abstract). This produces chunks where the Introduction appears before the Abstract.
Evidence
PDF visual
Page 1 is single-column with a two-cell table near the top:
┌─────────────────────────────────────────┐
│ Title + Authors │
├───────────────────┬─────────────────────┤
│ ARTICLE INFO │ ABSTRACT │
│ Keywords: LLMs │ The introduction │
│ Training │ of ChatGPT has... │
│ Inference │ │
│ Survey │ │
├───────────────────┴─────────────────────┤
│ 1. Introduction │
│ Language modeling (LM) is a... │
└─────────────────────────────────────────┘
Docling JSON reading order (texts[] with bbox y-coordinates)
[12] p.1 y=500 section_header "ARTICLE INFO"
[13] p.1 y=480 text "Keywords : Large Language Models Training Inference Survey"
[14] p.1 y=348 section_header "1. Introduction" ← jumps below the table
[15] p.1 y=330 text "Language modeling (LM) is a fundamental..."
[16] p.1 y=175 text "The Transformer architecture is exceptionally..."
[17] p.1 y=86 footnote "Corresponding author ORCID(s):"
[18] p.1 y=499 section_header "ABSTRACT" ← jumps back UP to table
[19] p.1 y=480 text "The introduction of ChatGPT has led to..."
Items 14-17 (Introduction) appear before item 18 (ABSTRACT) even though the Abstract is physically above the Introduction on the page.
Docling chunks output
chunk 0:
headings: ['Understanding LLMs: A Comprehensive Overview from Training to Inference']
text: 'Yiheng Liu a , Hao He a , Tianle Han a ...' (title + authors)
chunk 1:
headings: ['ARTICLE INFO']
text: 'Keywords : Large Language Models Training Inference Survey'
chunk 2:
headings: ['1. Introduction']
text: 'Language modeling (LM) is a fundamental approach...' ← before Abstract
chunk 3:
headings: ['ABSTRACT']
text: 'The introduction of ChatGPT has led to a significant increase...' ← after Introduction
Markdown output
Same misordering — Introduction at char ~800, Abstract at char ~3224.
Root cause
Docling's layout analysis reads the ARTICLE INFO / ABSTRACT table as two separate columns rather than a single table row. It processes the left column, exits the table downward into the Introduction section, then returns to read the right column.
Impact
- Content pillars may group Abstract text with Introduction content
- Topic structure reflects wrong document order
- Summaries may conflate Introduction and Abstract
- Common in Elsevier preprint format, arXiv papers with this layout
- Affects any PDF with side-by-side sections in a table/box