apps/learning-api/evals-playground/known-issues/docling-table-reading-order/README.md

Table/box layout causes wrong reading order

GitHub issue: #1863 PDF: understanding-llms.pdfStatus: Open Categories: parsing, chunking, content-pillars Deck name: 05-how-llms-work Document type: Academic paper (arXiv preprint, Elsevier format)

Table/box layout causes wrong reading order

GitHub issue: #1863 PDF: understanding-llms.pdfStatus: Open Categories: parsing, chunking, content-pillars Deck name: 05-how-llms-work Document type: Academic paper (arXiv preprint, Elsevier format)

Problem

When a PDF has an ARTICLE INFO / ABSTRACT side-by-side table on page 1, docling reads the left column (ARTICLE INFO) first, jumps down past the table to the Introduction, then comes back up to read the right column (Abstract). This produces chunks where the Introduction appears before the Abstract.

Evidence

PDF visual

Page 1 is single-column with a two-cell table near the top:

┌─────────────────────────────────────────┐
│  Title + Authors                        │
├───────────────────┬─────────────────────┤
│  ARTICLE INFO     │  ABSTRACT           │
│  Keywords: LLMs   │  The introduction   │
│  Training         │  of ChatGPT has...  │
│  Inference        │                     │
│  Survey           │                     │
├───────────────────┴─────────────────────┤
│  1. Introduction                        │
│  Language modeling (LM) is a...         │
└─────────────────────────────────────────┘

Docling JSON reading order (texts[] with bbox y-coordinates)

[12] p.1  y=500  section_header  "ARTICLE INFO"
[13] p.1  y=480  text            "Keywords : Large Language Models Training Inference Survey"
[14] p.1  y=348  section_header  "1. Introduction"          ← jumps below the table
[15] p.1  y=330  text            "Language modeling (LM) is a fundamental..."
[16] p.1  y=175  text            "The Transformer architecture is exceptionally..."
[17] p.1  y=86   footnote        "Corresponding author ORCID(s):"
[18] p.1  y=499  section_header  "ABSTRACT"                 ← jumps back UP to table
[19] p.1  y=480  text            "The introduction of ChatGPT has led to..."

Items 14-17 (Introduction) appear before item 18 (ABSTRACT) even though the Abstract is physically above the Introduction on the page.

Docling chunks output

chunk 0:
  headings: ['Understanding LLMs: A Comprehensive Overview from Training to Inference']
  text: 'Yiheng Liu a , Hao He a , Tianle Han a ...'  (title + authors)

chunk 1:
  headings: ['ARTICLE INFO']
  text: 'Keywords : Large Language Models Training Inference Survey'

chunk 2:
  headings: ['1. Introduction']
  text: 'Language modeling (LM) is a fundamental approach...'  ← before Abstract

chunk 3:
  headings: ['ABSTRACT']
  text: 'The introduction of ChatGPT has led to a significant increase...'  ← after Introduction

Markdown output

Same misordering — Introduction at char ~800, Abstract at char ~3224.

Root cause

Docling's layout analysis reads the ARTICLE INFO / ABSTRACT table as two separate columns rather than a single table row. It processes the left column, exits the table downward into the Introduction section, then returns to read the right column.

Impact

  • Content pillars may group Abstract text with Introduction content
  • Topic structure reflects wrong document order
  • Summaries may conflate Introduction and Abstract
  • Common in Elsevier preprint format, arXiv papers with this layout
  • Affects any PDF with side-by-side sections in a table/box