Indian invoices are uniquely challenging for document AI. Between GSTIN numbers, CGST/SGST/IGST splits, HSN codes, and the sheer variety of formats across states and businesses — most off-the-shelf OCR tools fail. We built a custom pipeline that handles this. Here is what we learned.
Why Indian Invoices Are Hard
Unlike standardized invoices in the US or EU, Indian invoices have:
- GSTIN validation — 15-character alphanumeric codes that follow a specific format (state code + PAN + entity number)
- Multiple tax components — CGST + SGST for intra-state, IGST for inter-state, plus cess on certain items
- HSN/SAC codes — 4-8 digit product/service classification codes required on every line item
- Diverse formats — every business uses different invoice software, layouts, fonts. Some are handwritten
- Regional languages — invoices in Hindi, Marathi, Tamil alongside English
- Poor scan quality — faded thermal prints, stamped receipts, photographed documents
Our Three-Stage Pipeline
Stage 1: OCR with Pre-processing
Raw OCR on Indian documents gives 70-80% accuracy. We push this to 95%+ with pre-processing:
- Deskew and rotation correction
- Contrast enhancement for faded prints
- Noise removal for scanned documents
- Text region detection before OCR (avoids processing logos, images)
We use Tesseract with custom-trained models for Hindi/Marathi, plus Google Vision API as a fallback for difficult documents.
Stage 2: LLM-Powered Field Extraction
This is where the magic happens. Instead of regex patterns (which break on every new invoice format), we feed the OCR text to an LLM with a structured prompt:
- Extract seller GSTIN, buyer GSTIN
- Extract invoice number, date
- Extract each line item: description, HSN code, quantity, rate, amount
- Extract CGST, SGST, IGST amounts and rates
- Extract total, round-off, grand total
The LLM handles format variations, abbreviations, and messy OCR artifacts that would break any rule-based system.
Stage 3: Validation and Cross-checking
We never trust the LLM output blindly:
- GSTIN validation — check format, verify check digit, validate state code
- Math validation — do line items sum to subtotal? Does subtotal + tax = total?
- Tax rate validation — is CGST rate a valid GST slab (0%, 5%, 12%, 18%, 28%)?
- Date validation — is the invoice date in a valid range?
- Confidence scoring — flag low-confidence extractions for human review
Handling the Edge Cases
Spaced characters in scanned PDFs
A common OCR artifact: "G S T I N : 2 7 A A A C X 1 2 3 4 E 1 Z 5" instead of "GSTIN: 27AAACX1234E1Z5". We normalize by collapsing single-character spaces before extraction.
CGST amount vs CGST rate confusion
When the invoice says "CGST @ 9% Rs. 3,487.50", naive regex grabs "9" as the amount. Our pipeline looks for the amount AFTER the percentage sign.
Sub Total vs Grand Total
Many invoices have "Sub Total", "Total", "Grand Total", "Net Amount" — all meaning different things. We use negative lookbehind patterns and cross-reference with the mathematical sum to identify the correct total.
Results
- 99.4% field-level accuracy on structured invoices (Tally, Zoho, Busy-generated)
- 96% accuracy on handwritten/informal invoices
- Processing time: 2-4 seconds per invoice
- 14M+ documents processed in production
Tech Stack
- OCR: Tesseract + Google Vision API
- LLM: Claude API for extraction
- Validation: Python with custom GSTIN/HSN validators
- Queue: Redis for async processing
- Storage: AWS S3 for documents, PostgreSQL for extracted data
Want This for Your Business?
If you process Indian invoices at scale — whether for accounting, GST filing, or ERP integration — we can build a custom pipeline for your specific document types. We have done this for healthcare billing, manufacturing purchase orders, and e-commerce vendor invoices.
First call is free. We will look at a sample of your invoices and tell you what accuracy to expect.
