Empty section headers dropped by docling chunker
GitHub issue: #1861 PDF: virtue-ethics.pdfStatus: Open Categories: parsing, chunking, content-pillars, mindmaps, flashcards Deck name: 2.virtueethics&feministethics Document type: University lecture slides (ethics/psychology)
Empty section headers dropped by docling chunker
GitHub issue: #1861 PDF: virtue-ethics.pdfStatus: Open Categories: parsing, chunking, content-pillars, mindmaps, flashcards Deck name: 2.virtueethics&feministethics Document type: University lecture slides (ethics/psychology)
Problem
When a section header has no body text before the next header, docling's chunker drops it entirely. It doesn't appear in any chunk's text or headings array. The heading is silently lost from all downstream features.
Evidence
PDF visual
Page 2 of the PDF shows a slide with this structure:
- "Determining moral standards" (heading with body text below)
- "Theories of right conduct" (heading with NO body text — next item is another heading)
- "Central questions of normative ethics:" (heading with bullet points)
Extracted markdown (from md_url)
- Standards can help in applying principles - they are based on principles
- One principle may underlie different norms (interpretations)
## Theories of right conduct
## Central questions of normative ethics:
- What moral principles should we accept?
- What makes right actions right?
- Is there a single fundamental principle of morality?
The heading exists in markdown — docling's document parser extracted it correctly.
Docling chunks output (from docling_chunks_url)
chunk 2:
chunk_index: 2
headings: ['Determining moral standards']
page_start: 1, page_end: 2
text: 'The approach we take to the question of ethics already fashions and delimits
the very question of what we consider to be ethical or potentially ethical
in the first place...'
chunk 3:
chunk_index: 3
headings: ['Central questions of normative ethics:']
page_start: 2, page_end: 2
text: '- What moral principles should we accept?
- What makes right actions right?
- Is there a single fundamental principle of morality?'
chunk 4:
chunk_index: 4
headings: ['Normative applied ethics']
page_start: 2, page_end: 2
text: '- The top-down approach to applied ethics:
- Work out the correct normative theory in general terms...'
"Theories of right conduct" is completely absent:
- Not in chunk 3's
headingsbreadcrumb (should be a parent heading) - Not as its own chunk
- Not in any chunk's text
Content pillar impact
The content pillar for this deck has no topic or subtopic referencing "Theories of right conduct." The structural grouping that this heading provides (it's a parent category for multiple ethical frameworks) is lost.
Root cause
Docling's hierarchical chunker (docling_core.transforms.chunker) skips headings that have no body text before the next heading. The heading is in the document tree but gets no chunk assignment because there's no text content to attach to it.
Impact
- Content pillars miss a structural grouping heading
- Flashcards/quizzes can't reference this concept
- Mindmaps miss a branch that would organize sub-topics
- Affects any document where headings act as category groupers (common in textbooks, lecture slides)
- Particularly common in slide-based documents where headings introduce sections without prose