Scans to structured text

Not every book is available as an audiobook. Military history, academic texts, out-of-print nonfiction—if you want to listen to them, you have to make them yourself.

Raw OCR gets you text, but text mixed with garbage—page numbers read aloud, running headers interrupting every page, image captions scattered through the text, footnote numbers with no footnotes. The result is technically an audiobook, but not one you'd want to listen to.

Getting something close to the quality of a professionally produced Audible audiobook—with no user discernible issues—turns out to be an unsolved problem. Those audiobooks have human editors. Getting that quality with software alone, with little to no human intervention, is the challenge shelf is designed to solve.

Processing Pipeline

Once you have a book scan (either from a sheet-fed scanner after cutting the spine, or a non-destructive overhead scanner), shelf runs it through a multi-stage AI pipeline. Each stage builds on the last: multiple OCR providers extract the text, LLMs classify and label the structure (body text vs headers vs footnotes vs page numbers), the table of contents is extracted and linked to actual pages, and finally clean ePub files are produced—text that flows when read aloud.

📷

OCR Pages

Extract text from scanned page images using vision AI models

  • mistral
    Extract text using Mistral vision model
  • olm
    Extract text using OLM OCR model
  • paddle
    Extract text using PaddleOCR
  • blend
    Combine OCR outputs into best-quality text
🏷️

Label Structure

Classify content blocks as body text, headers, footnotes, or page numbers

  • mechanical
    Extract patterns using regex and heuristics
  • unified
    Classify page elements with vision LLM
  • gap_analysis
    Identify pages with missing or uncertain labels
  • agent_healing
    Fix classification gaps using LLM agent
📑

Extract ToC

Identify and extract the table of contents from OCR output

  • find
    Locate ToC pages using vision agent
  • extract
    Parse ToC entries from identified pages
🔗

Link ToC

Map table of contents entries to their corresponding page numbers

  • find_entries
    Locate each ToC entry in page content
  • pattern
    Analyze heading patterns to find candidates
  • evaluation
    Evaluate candidate headings with vision LLM
  • merge
    Merge results into enriched ToC
🏗️

Build Structure

Assemble unified document structure with chapter text and metadata

  • build_structure
    Build document skeleton from ToC and detected headings
  • polish_entries
    Extract and polish chapter text with LLM (parallel)
  • merge
    Merge polished entries into final structure.json
📖

Generate Output

Create ePub files, audiobook scripts, or structured API output

  • generate_epub
    Build and validate ePub 3.0 file