Production-grade FastAPI backend for parsing Invoice (PDF) and Bank Statement (CSV) files into structured financial data.
- Overview
- Project Documentation
- Architecture
- Database Schema
- CSV Parser Pipeline
- File Handling & Edge Cases
- Quick Start
- API Documentation
- Running Tests
- Deployment
- Known Limitations
FinParse is an internal tool for processing financial documents from multiple vendors and banks. It:
- Accepts PDF invoices and CSV bank statements (+ XLSX auto-converted)
- Runs a 5-stage parsing pipeline that handles inconsistent real-world formats
- Persists structured data in PostgreSQL with normalized schema
- Exposes RESTful APIs for upload, retrieval, filtering, and deletion
- Handles 14 categories of file errors gracefully with structured error responses
Detailed architecture, design decisions, and testing logs are located in the docs/ folder:
- Database Design & ER Diagram: Explains the relational database table design, money safety rules, constraints, and holds the visual Mermaid Entity-Relationship (ER) diagram.
- Database Schema DDL Specification: Contains the exact SQL DDL specs, indices, and database decisions.
- File Handling & Edge Cases: Technical breakdown of the 3-stage validation pipeline and how PDF/CSV edge cases are resolved.
- Manual Testing Log: Structured template to document your manual testing steps via Swagger UI and attach validation screenshots.
┌─────────────────────────────────────────────────────────────────────┐
│ FastAPI (async) │
│ POST /api/v1/documents/upload → GET /api/v1/documents/{id} │
└─────────────┬───────────────────────────────────────────────────────┘
│
┌────────▼─────────┐
│ FileValidator │ ← 3-stage: extension → MIME → content
│ (3-stage) │
└────────┬─────────┘
│
┌────────▼─────────┐
│ DocumentService │ ← orchestrates parsing and DB writes
└────────┬─────────┘
│
┌────────▼─────────┐
│ ParserFactory │ ← resolves parser type dynamically
└──────┬─────┬─────┘
│ │
┌────────▼──┐ ┌──▼────────┐
│ CSVParser │ │ PDFParser │ ← both extend BaseParser
└────────┬──┘ └──┬────────┘
│ │
┌──────▼─────▼─────┐
│ PostgreSQL │ ← via SQLAlchemy (async)
└──────────────────┘
| Layer | Technology |
|---|---|
| API Framework | FastAPI 0.111 (async) |
| ORM | SQLAlchemy 2.0 (async) |
| Database | PostgreSQL 16 |
| Background Jobs | Celery + Redis (wired, optional) |
| Logging | structlog (JSON in prod, pretty in dev) |
| Containerization | Docker Compose |
documents (1) ──────────────── (N) processing_jobs
│
├── (1) invoices
│ └── (N) line_items
│ └── (N:1) vendors
│
├── (1) bank_statements
│ └── (N) bank_transactions
│
| Decision | Rationale |
|---|---|
documents is immutable |
Raw file artifact never mutated — source of truth |
processing_jobs separate from documents |
Supports re-parsing without data loss |
amount >= 0 + explicit direction ('C'/'D') |
No sign bugs across bank formats |
NUMERIC(18,4) for all money |
Never use floats for financial data |
checksum_sha256 UNIQUE |
Deduplication at DB level (race-condition safe) |
raw_* columns on every extracted table |
Original text preserved for debugging |
status = 'partial' as first-class state |
Real-world data is messy — partial success is valid |
See schema_design.md for full DDL.
The CSV parser runs a 5-stage pipeline, each stage independently testable:
Stage 1: FileReader → Encoding detection (chardet + fallbacks)
Stage 2: FormatDetector → Delimiter, header row discovery, column mapping
Stage 3: RowFilter → Skip blanks, metadata rows, summary rows
Stage 4: RowParser → Parse date + amount + description per row
Stage 5: PostProcessor → Balance continuity, currency detection, stats
Delimiters: , ; \t |
Encodings: UTF-8, UTF-8-BOM, Latin-1, Windows-1252, UTF-16
Amount formats:
| Format | Example |
|---|---|
| Standard | 1,234.56 |
| European | 1.234,56 |
| Indian grouping | 1,23,456.78 |
| Parenthetical negative | (1,234.56) |
| DR/CR suffix | 500.00 DR |
| Currency symbol | $500, ₹1,000, €250 |
Date formats:
| Format | Example |
|---|---|
| ISO | 2026-05-29 |
| UK/EU | 29/05/2026 |
| US | 05/29/2026 |
| Month name | 29-May-2026 |
| Excel serial | 46044 |
| Timestamp | 2026-05-29 14:30:00 |
The PDF parser runs a 4-stage pipeline to extract structured invoices:
Stage 1: Text & Table Extraction → Extract raw text and grid structures page-by-page using pdfplumber
Stage 2: Heuristic Field Mapping → Pattern matches core invoice headers (dates, amounts, invoice number, currency, vendor)
Stage 3: Table Parsing & Mapping → Identifies line items tabular sections, maps headers, and extracts line details
Stage 4: Post-Processing & Reconciliation → Verifies itemized math (sum of lines vs subtotal/total), computes extraction confidence, collects warnings
- Invoice Number: Matches patterns like
Invoice #:,INV-,Bill No:using flexible regexes. - Dates: Resolves
Invoice DateandDue Dateutilizing robust locale-aware date parsing. - Amounts: Extracts
Subtotal,Tax Amount,Discount Amount, andTotal Amountusing regex patterns and validates with babel parser. - Currency: Guesses currency based on ISO symbols/codes (e.g.
$,€,£,₹,USD,EUR) found near the totals. - Vendor Registry: Resolves vendor names using surrounding text labels, with a fallback to the first non-empty lines of the document structure. Deduplicates matching vendor names case-insensitively in the database.
| Case | Handling |
|---|---|
| Password-protected | Prompt for password via pdf_password form field; never stored |
| Wrong password | 422 PDF_WRONG_PASSWORD |
| Scanned (image-only) | OCR detection flag set; ocr_needed=true in job |
| Corrupted / truncated | 422 PDF_CORRUPTED |
| Too many pages | 422 PDF_TOO_MANY_PAGES (limit: env-configurable) |
| MIME spoofing | Magic byte check, 415 FILE_TYPE_MISMATCH |
| Case | Handling |
|---|---|
| Missing critical fields | Lowers extraction confidence score (each missing field deducts 0.15) |
| No tabular structure | Gracefully continues with empty line items; records warning |
| Table header column mismatch | Flexible column mapping using heuristic keyword lists |
| Invoice total math mismatch | Performs reconciliation check; flags warning on mismatch |
| Case | Handling |
|---|---|
| Auto-encoding detection | chardet + fallback chain |
| Mixed delimiters | Sniffer → scoring fallback |
| European decimal comma | Regex detection, auto-normalize |
| Metadata rows before header | Header row discovery (first 15 rows scanned) |
| Summary rows (Total, Balance) | Regex pattern skip |
| XLSX uploaded as CSV | Auto-converted via openpyxl |
| Max rows exceeded | Truncate + warning in job |
| Code | HTTP | Trigger |
|---|---|---|
DUPLICATE_FILE |
409 | Same SHA-256, use ?allow_reprocess=true to re-parse |
PDF_PASSWORD_REQUIRED |
422 | Encrypted PDF, no password given |
PDF_WRONG_PASSWORD |
422 | Wrong password |
PDF_CORRUPTED |
422 | Malformed/truncated PDF |
CSV_MISSING_REQUIRED_COLUMNS |
422 | No date or amount column found |
FILE_TOO_LARGE |
413 | Exceeds size limit |
FILE_TYPE_MISMATCH |
415 | MIME spoofing detected |
PARTIAL_PARSE |
207 | Parsed with warnings |
- Python 3.12+
- Docker & Docker Compose
- (Optional) PostgreSQL 16 if running without Docker
git clone https://github.com/Darshan-dlr/finparse-api.git
cd finparse-api
cp .env.example .env
# Edit .env with your values# Start PostgreSQL + Redis + API + Worker
docker-compose up --build
# Or just the database (run API locally)
docker-compose up -d db redis# Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/macOS
# Install dependencies
pip install -r requirements.txt
# Start the API
uvicorn app.main:app --reload --port 8000http://localhost:8000/docs
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/documents/upload |
Upload CSV/PDF for parsing |
GET |
/api/v1/documents/ |
List & filter documents |
GET |
/api/v1/documents/{id} |
Get document + parsed data |
GET |
/api/v1/documents/{id}/job |
Get latest job status |
DELETE |
/api/v1/documents/{id} |
Soft-delete document |
GET |
/health |
Health check |
# Upload a CSV bank statement
curl -X POST http://localhost:8000/api/v1/documents/upload \
-F "file=@tests/sample_files/standard_bank_statement.csv"
# Upload with duplicate handling
curl -X POST "http://localhost:8000/api/v1/documents/upload?allow_reprocess=true" \
-F "file=@tests/sample_files/standard_bank_statement.csv"
# Upload a password-protected PDF
curl -X POST http://localhost:8000/api/v1/documents/upload \
-F "file=@invoice.pdf" \
-F "pdf_password=secret123"# Filter by document type and status
curl "http://localhost:8000/api/v1/documents/?document_type=bank_statement&status=completed"
# Filter by currency
curl "http://localhost:8000/api/v1/documents/?currency=INR&limit=50"# Install dependencies
pip install -r requirements.txt
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=app --cov-report=html
# Run just the CSV parser tests (no DB needed)
pytest tests/test_csv_parser.py -v
# Run specific test class
pytest tests/test_csv_parser.py::TestAmountParser -v| Module | Tests |
|---|---|
amount_parser.py |
17 cases — all format variants |
date_parser.py |
13 cases — ISO, UK, US, Excel serial, ambiguous |
csv_parser.py |
13 integration cases — 3 real bank statement formats |
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
required | PostgreSQL async URL |
REDIS_URL |
redis://localhost:6379/0 |
Celery broker |
STORAGE_BACKEND |
local |
local | s3 | gcs |
MAX_PDF_SIZE_MB |
50 |
PDF upload limit |
MAX_CSV_SIZE_MB |
25 |
CSV upload limit |
MAX_PDF_PAGES |
200 |
Max pages per PDF |
MAX_CSV_ROWS |
100000 |
Max rows per CSV |
LOG_LEVEL |
INFO |
DEBUG | INFO | WARNING |
ENVIRONMENT |
development |
development | production |
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ API Gateway│────▶│ ECS Fargate │────▶│ RDS Postgres│
│ (or ALB) │ │ (FastAPI) │ │ (db.t4g.med)│
└─────────────┘ └──────┬───────┘ └──────────────┘
│
┌──────▼───────┐ ┌──────────────┐
│ Celery │────▶│ ElastiCache │
│ Workers │ │ (Redis) │
└──────────────┘ └──────────────┘
│
┌──────▼───────┐
│ S3 Bucket │ ← file storage
└──────────────┘
Quick Render deploy:
- Connect GitHub repo to Render
- Add PostgreSQL service
- Set environment variables
- Deploy with
uvicorn app.main:app --host 0.0.0.0 --port $PORT
finparse-api/
├── app/
│ ├── main.py # FastAPI app + error handlers
│ ├── config.py # Settings (pydantic-settings)
│ ├── database.py # Async SQLAlchemy engine
│ ├── dependencies.py # FastAPI DI (DB session)
│ ├── core/
│ │ ├── exceptions.py # 14 typed exception classes
│ │ └── logging.py # structlog setup
│ ├── models/ # SQLAlchemy ORM models
│ │ ├── document.py
│ │ ├── processing_job.py
│ │ ├── bank_statement.py # BankStatement + BankTransaction
│ ├── parsers/
│ │ └── csv_parser.py # 5-stage CSV pipeline (main focus)
│ ├── utils/
│ │ ├── amount_parser.py # 9 amount format variants
│ │ └── date_parser.py # 8 date format variants + Excel
│ ├── validators/
│ │ └── file_validator.py # 3-stage file validation
│ ├── services/
│ │ └── document_service.py # Upload + parse + persist
│ └── api/v1/
│ ├── router.py
│ └── endpoints/
│ └── documents.py
├── tests/
│ ├── sample_files/
│ │ ├── standard_bank_statement.csv
│ │ ├── european_semicolon_statement.csv
│ │ └── hdfc_style_statement.csv
│ └── test_csv_parser.py # 43 test cases
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── .env.example
-
Invoice parsing (PDF) — Fully implemented using a local heuristic and table extraction pipeline (
pdfplumber). For complex/varying formats in production, see the Advanced ML/AI Suggestions below. -
OCR for scanned PDFs — Detection is in place (
ocr_neededflag), but Tesseract/cloud OCR integration is not yet connected. Scanned-only PDFs will raise anOCRFailedError. -
Vendor deduplication — Exact case-insensitive match only. Fuzzy matching via
pg_trgmor sentence embeddings is marked as TODO. -
Cross-currency normalization — Amounts stored in original currency. Base currency conversion with exchange rates is marked as TODO.
-
Authentication — No auth implemented. Placeholder
uploaded_byfield ready for JWT/API key integration. -
Background processing — Celery is wired in
docker-compose.ymlbut parsing runs synchronously in the request for now. Move_parse_csv()and_parse_pdf()to Celery tasks for production. -
Storage — Local disk storage only. S3/GCS integration hooks are in place (swap
_save_to_storage()inDocumentService). -
Multi-invoice PDFs — Schema supports
invoice_indexandpage_range_*but multi-invoice splitting is not yet implemented in the parser.
In production systems, rule-based heuristics and standard regex engines can fail if invoices change layout or use complex multi-column structures. Below are recommended modern approaches to achieve near-100% parsing accuracy:
-
Multimodal LLMs (Recommended):
- Use APIs like Gemini Flash / Pro or GPT-4o to parse document pages (as images or extracted text layouts).
- Use Pydantic schemas with LLM Structured Outputs (JSON Schema enforcement) to parse invoice totals, vendor details, and line-item tables with high accuracy, automatically handling diverse formats.
-
Fine-Tuned Layout/Visual Transformers:
- Use open-source transformer models like LayoutLM (v1/v2/v3) or Donut (OCR-free document understander) to map text and coordinates (bounding boxes) to structured tables and headers.
-
Document AI SaaS Engines:
- Integrate with pre-trained specialized models like Google Cloud Document AI (Invoice Parser) or AWS Textract (Analyze Expense). These services automatically resolve fields, line items, tax breakdowns, and currencies with built-in OCR.