Xceed ImaginationLet's talk →
← All posts
Document AIGSTOCRIndian invoicesdata extraction

How to Extract Data from Indian GST Invoices Using AI

April 12, 202611 min readBy Xceed Engineering

Indian invoices are uniquely challenging for document AI. Between GSTIN numbers, CGST/SGST/IGST splits, HSN codes, and the sheer variety of formats across states and businesses — most off-the-shelf OCR tools fail. We built a custom pipeline that handles this. Here is what we learned.

Why Indian Invoices Are Hard

Unlike standardized invoices in the US or EU, Indian invoices have:

  • GSTIN validation — 15-character alphanumeric codes that follow a specific format (state code + PAN + entity number)
  • Multiple tax components — CGST + SGST for intra-state, IGST for inter-state, plus cess on certain items
  • HSN/SAC codes — 4-8 digit product/service classification codes required on every line item
  • Diverse formats — every business uses different invoice software, layouts, fonts. Some are handwritten
  • Regional languages — invoices in Hindi, Marathi, Tamil alongside English
  • Poor scan quality — faded thermal prints, stamped receipts, photographed documents

Our Three-Stage Pipeline

Stage 1: OCR with Pre-processing

Raw OCR on Indian documents gives 70-80% accuracy. We push this to 95%+ with pre-processing:

  • Deskew and rotation correction
  • Contrast enhancement for faded prints
  • Noise removal for scanned documents
  • Text region detection before OCR (avoids processing logos, images)

We use Tesseract with custom-trained models for Hindi/Marathi, plus Google Vision API as a fallback for difficult documents.

Stage 2: LLM-Powered Field Extraction

This is where the magic happens. Instead of regex patterns (which break on every new invoice format), we feed the OCR text to an LLM with a structured prompt:

  • Extract seller GSTIN, buyer GSTIN
  • Extract invoice number, date
  • Extract each line item: description, HSN code, quantity, rate, amount
  • Extract CGST, SGST, IGST amounts and rates
  • Extract total, round-off, grand total

The LLM handles format variations, abbreviations, and messy OCR artifacts that would break any rule-based system.

Stage 3: Validation and Cross-checking

We never trust the LLM output blindly:

  • GSTIN validation — check format, verify check digit, validate state code
  • Math validation — do line items sum to subtotal? Does subtotal + tax = total?
  • Tax rate validation — is CGST rate a valid GST slab (0%, 5%, 12%, 18%, 28%)?
  • Date validation — is the invoice date in a valid range?
  • Confidence scoring — flag low-confidence extractions for human review

Handling the Edge Cases

Spaced characters in scanned PDFs

A common OCR artifact: "G S T I N : 2 7 A A A C X 1 2 3 4 E 1 Z 5" instead of "GSTIN: 27AAACX1234E1Z5". We normalize by collapsing single-character spaces before extraction.

CGST amount vs CGST rate confusion

When the invoice says "CGST @ 9% Rs. 3,487.50", naive regex grabs "9" as the amount. Our pipeline looks for the amount AFTER the percentage sign.

Sub Total vs Grand Total

Many invoices have "Sub Total", "Total", "Grand Total", "Net Amount" — all meaning different things. We use negative lookbehind patterns and cross-reference with the mathematical sum to identify the correct total.

Results

  • 99.4% field-level accuracy on structured invoices (Tally, Zoho, Busy-generated)
  • 96% accuracy on handwritten/informal invoices
  • Processing time: 2-4 seconds per invoice
  • 14M+ documents processed in production

Tech Stack

  • OCR: Tesseract + Google Vision API
  • LLM: Claude API for extraction
  • Validation: Python with custom GSTIN/HSN validators
  • Queue: Redis for async processing
  • Storage: AWS S3 for documents, PostgreSQL for extracted data

Want This for Your Business?

If you process Indian invoices at scale — whether for accounting, GST filing, or ERP integration — we can build a custom pipeline for your specific document types. We have done this for healthcare billing, manufacturing purchase orders, and e-commerce vendor invoices.

First call is free. We will look at a sample of your invoices and tell you what accuracy to expect.

More posts