Building a Document AI Pipeline That Actually Scales
OCR, layout detection, structured extraction, EHR integration. How we process 14M+ docs/year for a hospital network.
Processing 14 million documents per year for a 600-bed hospital network taught us that document AI is 20% model work and 80% engineering.
The pipeline
Four stages, each with its own failure modes:
Stage 1: Ingestion
Documents arrive as scanned PDFs, photos from phones, faxes, and direct digital uploads. We normalize everything to high-res PNGs with deskewing, denoising, and adaptive contrast. This stage handles 40+ input formats.
Stage 2: Layout detection
Before OCR, we detect the document structure: where are the tables? The headers? The handwritten vs. printed regions? We use a LayoutLM-based model fine-tuned on 5,000 annotated medical forms. This step is what makes extraction reliable — without it, OCR returns a jumble of text with no structure.
Stage 3: Extraction
OCR for printed text (Tesseract with custom post-processing), a separate handwriting recognition model for handwritten fields, and Claude API for interpreting ambiguous or poorly-scanned text. The LLM step is the fallback, not the primary extractor — it handles the 15% of fields that rule-based extraction can't.
Stage 4: Validation + Integration
Every extracted field is validated: patient IDs against the hospital's master index, medication names against RxNorm, dates for logical consistency. Validated data writes to Epic EHR via HL7 FHIR API. Anything that fails validation gets queued for human review.
Scaling considerations
- Queue-based architecture. Each stage is a separate service with its own queue. If extraction is slow, ingestion doesn't stop.
- Horizontal scaling. We run 8 extraction workers in parallel during peak hours (morning admissions), scale down to 2 at night.
- Monitoring. Per-stage latency, accuracy, and queue depth dashboards. Alerts when accuracy drops below 98% on any field.
The accuracy numbers
After 6 months in production: 99.4% accuracy on structured fields (patient name, DOB, ID), 96.8% on semi-structured fields (medications, diagnoses), and 94.1% end-to-end (all fields correct on a single document). The remaining 5.9% get human review.
