shelf - Scans to structured text

Scans to structured text

Not every book is available as an audiobook. Military history, academic texts, out-of-print nonfiction—if you want to listen to them, you have to make them yourself.

Raw OCR gets you text, but text mixed with garbage—page numbers read aloud, running headers interrupting every page, image captions scattered through the text, footnote numbers with no footnotes. The result is technically an audiobook, but not one you'd want to listen to.

Getting something close to the quality of a professionally produced Audible audiobook—with no user discernible issues—turns out to be an unsolved problem. Those audiobooks have human editors. Getting that quality with software alone, with little to no human intervention, is the challenge shelf is designed to solve.

Processing Pipeline

Once you have a book scan (either from a sheet-fed scanner after cutting the spine, or a non-destructive overhead scanner), shelf runs it through a multi-stage AI pipeline. Each stage builds on the last: multiple OCR providers extract the text, LLMs classify and label the structure (body text vs headers vs footnotes vs page numbers), the table of contents is extracted and linked to actual pages, and finally clean ePub files are produced—text that flows when read aloud.

📷

OCR Pages

Extract text from scanned page images using vision AI models

mistral

Extract text using Mistral vision model
olm

Extract text using OLM OCR model
paddle

Extract text using PaddleOCR
blend

Combine OCR outputs into best-quality text

🏷️

Label Structure

Classify content blocks as body text, headers, footnotes, or page numbers

mechanical

Extract patterns using regex and heuristics
unified

Classify page elements with vision LLM
gap_analysis

Identify pages with missing or uncertain labels
agent_healing

Fix classification gaps using LLM agent

📑

Extract ToC

Identify and extract the table of contents from OCR output

find

Locate ToC pages using vision agent
extract

Parse ToC entries from identified pages

🔗

Link ToC

Map table of contents entries to their corresponding page numbers

find_entries

Locate each ToC entry in page content
pattern

Analyze heading patterns to find candidates
evaluation

Evaluate candidate headings with vision LLM
merge

Merge results into enriched ToC

🏗️

Build Structure

Assemble unified document structure with chapter text and metadata

build_structure

Build document skeleton from ToC and detected headings
polish_entries

Extract and polish chapter text with LLM (parallel)
merge

Merge polished entries into final structure.json

📖

Generate Output

Create ePub files, audiobook scripts, or structured API output

generate_epub

Build and validate ePub 3.0 file