How to Extract Data from Indian GST Invoices Using...

Indian invoices are uniquely challenging for document AI. Between GSTIN numbers, CGST/SGST/IGST splits, HSN codes, and the sheer variety of formats across states and businesses — most off-the-shelf OCR tools fail. We built a custom pipeline that handles this. Here is what we learned.

Why Indian Invoices Are Hard

Unlike standardized invoices in the US or EU, Indian invoices have:

GSTIN validation — 15-character alphanumeric codes that follow a specific format (state code + PAN + entity number)
Multiple tax components — CGST + SGST for intra-state, IGST for inter-state, plus cess on certain items
HSN/SAC codes — 4-8 digit product/service classification codes required on every line item
Diverse formats — every business uses different invoice software, layouts, fonts. Some are handwritten
Regional languages — invoices in Hindi, Marathi, Tamil alongside English
Poor scan quality — faded thermal prints, stamped receipts, photographed documents

Our Three-Stage Pipeline

Stage 1: OCR with Pre-processing

Raw OCR on Indian documents gives 70-80% accuracy. We push this to 95%+ with pre-processing:

Deskew and rotation correction
Contrast enhancement for faded prints
Noise removal for scanned documents
Text region detection before OCR (avoids processing logos, images)

We use Tesseract with custom-trained models for Hindi/Marathi, plus Google Vision API as a fallback for difficult documents.

Stage 2: LLM-Powered Field Extraction

This is where the magic happens. Instead of regex patterns (which break on every new invoice format), we feed the OCR text to an LLM with a structured prompt:

Extract seller GSTIN, buyer GSTIN
Extract invoice number, date
Extract each line item: description, HSN code, quantity, rate, amount
Extract CGST, SGST, IGST amounts and rates
Extract total, round-off, grand total

The LLM handles format variations, abbreviations, and messy OCR artifacts that would break any rule-based system.

Stage 3: Validation and Cross-checking

We never trust the LLM output blindly:

GSTIN validation — check format, verify check digit, validate state code
Math validation — do line items sum to subtotal? Does subtotal + tax = total?
Tax rate validation — is CGST rate a valid GST slab (0%, 5%, 12%, 18%, 28%)?
Date validation — is the invoice date in a valid range?
Confidence scoring — flag low-confidence extractions for human review

Handling the Edge Cases

Spaced characters in scanned PDFs

A common OCR artifact: "G S T I N : 2 7 A A A C X 1 2 3 4 E 1 Z 5" instead of "GSTIN: 27AAACX1234E1Z5". We normalize by collapsing single-character spaces before extraction.

CGST amount vs CGST rate confusion

When the invoice says "CGST @ 9% Rs. 3,487.50", naive regex grabs "9" as the amount. Our pipeline looks for the amount AFTER the percentage sign.

Sub Total vs Grand Total

Many invoices have "Sub Total", "Total", "Grand Total", "Net Amount" — all meaning different things. We use negative lookbehind patterns and cross-reference with the mathematical sum to identify the correct total.

Results

99.4% field-level accuracy on structured invoices (Tally, Zoho, Busy-generated)
96% accuracy on handwritten/informal invoices
Processing time: 2-4 seconds per invoice
14M+ documents processed in production

Tech Stack

OCR: Tesseract + Google Vision API
LLM: Claude API for extraction
Validation: Python with custom GSTIN/HSN validators
Queue: Redis for async processing
Storage: AWS S3 for documents, PostgreSQL for extracted data

Want This for Your Business?

If you process Indian invoices at scale — whether for accounting, GST filing, or ERP integration — we can build a custom pipeline for your specific document types. We have done this for healthcare billing, manufacturing purchase orders, and e-commerce vendor invoices.

First call is free. We will look at a sample of your invoices and tell you what accuracy to expect.

How to Extract Data from Indian GST Invoices Using AI