FinParse API 🏦

Production-grade FastAPI backend for parsing Invoice (PDF) and Bank Statement (CSV) files into structured financial data.

📋 Table of Contents

Overview
Project Documentation
Architecture
Database Schema
CSV Parser Pipeline
File Handling & Edge Cases
Quick Start
API Documentation
Running Tests
Deployment
Known Limitations

Overview

FinParse is an internal tool for processing financial documents from multiple vendors and banks. It:

Accepts PDF invoices and CSV bank statements (+ XLSX auto-converted)
Runs a 5-stage parsing pipeline that handles inconsistent real-world formats
Persists structured data in PostgreSQL with normalized schema
Exposes RESTful APIs for upload, retrieval, filtering, and deletion
Handles 14 categories of file errors gracefully with structured error responses

📁 Project Documentation

Detailed architecture, design decisions, and testing logs are located in the docs/ folder:

Database Design & ER Diagram: Explains the relational database table design, money safety rules, constraints, and holds the visual Mermaid Entity-Relationship (ER) diagram.
Database Schema DDL Specification: Contains the exact SQL DDL specs, indices, and database decisions.
File Handling & Edge Cases: Technical breakdown of the 3-stage validation pipeline and how PDF/CSV edge cases are resolved.
Manual Testing Log: Structured template to document your manual testing steps via Swagger UI and attach validation screenshots.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          FastAPI (async)                            │
│  POST /api/v1/documents/upload  →  GET /api/v1/documents/{id}      │
└─────────────┬───────────────────────────────────────────────────────┘
              │
     ┌────────▼─────────┐
     │  FileValidator   │   ← 3-stage: extension → MIME → content
     │  (3-stage)       │
     └────────┬─────────┘
              │
     ┌────────▼─────────┐
     │  DocumentService │   ← orchestrates parsing and DB writes
     └────────┬─────────┘
              │
     ┌────────▼─────────┐
     │  ParserFactory   │   ← resolves parser type dynamically
     └──────┬─────┬─────┘
            │     │
   ┌────────▼──┐ ┌──▼────────┐
   │ CSVParser │ │ PDFParser │   ← both extend BaseParser
   └────────┬──┘ └──┬────────┘
            │     │
     ┌──────▼─────▼─────┐
     │  PostgreSQL      │   ← via SQLAlchemy (async)
     └──────────────────┘

Tech Stack

Layer	Technology
API Framework	FastAPI 0.111 (async)
ORM	SQLAlchemy 2.0 (async)
Database	PostgreSQL 16
Background Jobs	Celery + Redis (wired, optional)
Logging	structlog (JSON in prod, pretty in dev)
Containerization	Docker Compose

Database Schema

Entity Overview

documents (1) ──────────────── (N) processing_jobs
    │
    ├── (1) invoices
    │         └── (N) line_items
    │         └── (N:1) vendors
    │
    ├── (1) bank_statements
    │         └── (N) bank_transactions
    │

Key Design Decisions

Decision	Rationale
`documents` is immutable	Raw file artifact never mutated — source of truth
`processing_jobs` separate from `documents`	Supports re-parsing without data loss
`amount >= 0` + explicit `direction ('C'/'D')`	No sign bugs across bank formats
`NUMERIC(18,4)` for all money	Never use floats for financial data
`checksum_sha256 UNIQUE`	Deduplication at DB level (race-condition safe)
`raw_*` columns on every extracted table	Original text preserved for debugging
`status = 'partial'` as first-class state	Real-world data is messy — partial success is valid

See schema_design.md for full DDL.

CSV Parser Pipeline

The CSV parser runs a 5-stage pipeline, each stage independently testable:

Stage 1: FileReader      → Encoding detection (chardet + fallbacks)
Stage 2: FormatDetector  → Delimiter, header row discovery, column mapping
Stage 3: RowFilter       → Skip blanks, metadata rows, summary rows
Stage 4: RowParser       → Parse date + amount + description per row
Stage 5: PostProcessor   → Balance continuity, currency detection, stats

Supported Formats

Delimiters: , ; \t |

Encodings: UTF-8, UTF-8-BOM, Latin-1, Windows-1252, UTF-16

Amount formats:

Format	Example
Standard	`1,234.56`
European	`1.234,56`
Indian grouping	`1,23,456.78`
Parenthetical negative	`(1,234.56)`
DR/CR suffix	`500.00 DR`
Currency symbol	`$500`, `₹1,000`, `€250`

Date formats:

Format	Example
ISO	`2026-05-29`
UK/EU	`29/05/2026`
US	`05/29/2026`
Month name	`29-May-2026`
Excel serial	`46044`
Timestamp	`2026-05-29 14:30:00`

PDF Parser Pipeline

The PDF parser runs a 4-stage pipeline to extract structured invoices:

Stage 1: Text & Table Extraction         → Extract raw text and grid structures page-by-page using pdfplumber
Stage 2: Heuristic Field Mapping          → Pattern matches core invoice headers (dates, amounts, invoice number, currency, vendor)
Stage 3: Table Parsing & Mapping          → Identifies line items tabular sections, maps headers, and extracts line details
Stage 4: Post-Processing & Reconciliation  → Verifies itemized math (sum of lines vs subtotal/total), computes extraction confidence, collects warnings

Core Heuristics & Extracted Fields

Invoice Number: Matches patterns like Invoice #:, INV-, Bill No: using flexible regexes.
Dates: Resolves Invoice Date and Due Date utilizing robust locale-aware date parsing.
Amounts: Extracts Subtotal, Tax Amount, Discount Amount, and Total Amount using regex patterns and validates with babel parser.
Currency: Guesses currency based on ISO symbols/codes (e.g. $, €, £, ₹, USD, EUR) found near the totals.
Vendor Registry: Resolves vendor names using surrounding text labels, with a fallback to the first non-empty lines of the document structure. Deduplicates matching vendor names case-insensitively in the database.

File Handling & Edge Cases

PDF Validations

Case	Handling
Password-protected	Prompt for password via `pdf_password` form field; never stored
Wrong password	`422 PDF_WRONG_PASSWORD`
Scanned (image-only)	OCR detection flag set; `ocr_needed=true` in job
Corrupted / truncated	`422 PDF_CORRUPTED`
Too many pages	`422 PDF_TOO_MANY_PAGES` (limit: env-configurable)
MIME spoofing	Magic byte check, `415 FILE_TYPE_MISMATCH`

PDF Parser Edge Cases

Case	Handling
Missing critical fields	Lowers extraction confidence score (each missing field deducts 0.15)
No tabular structure	Gracefully continues with empty line items; records warning
Table header column mismatch	Flexible column mapping using heuristic keyword lists
Invoice total math mismatch	Performs reconciliation check; flags warning on mismatch

CSV Edge Cases

Case	Handling
Auto-encoding detection	chardet + fallback chain
Mixed delimiters	Sniffer → scoring fallback
European decimal comma	Regex detection, auto-normalize
Metadata rows before header	Header row discovery (first 15 rows scanned)
Summary rows (Total, Balance)	Regex pattern skip
XLSX uploaded as CSV	Auto-converted via openpyxl
Max rows exceeded	Truncate + warning in job

Error Code Reference

Code	HTTP	Trigger
`DUPLICATE_FILE`	409	Same SHA-256, use `?allow_reprocess=true` to re-parse
`PDF_PASSWORD_REQUIRED`	422	Encrypted PDF, no password given
`PDF_WRONG_PASSWORD`	422	Wrong password
`PDF_CORRUPTED`	422	Malformed/truncated PDF
`CSV_MISSING_REQUIRED_COLUMNS`	422	No date or amount column found
`FILE_TOO_LARGE`	413	Exceeds size limit
`FILE_TYPE_MISMATCH`	415	MIME spoofing detected
`PARTIAL_PARSE`	207	Parsed with warnings

Quick Start

Prerequisites

Python 3.12+
Docker & Docker Compose
(Optional) PostgreSQL 16 if running without Docker

1. Clone and configure

git clone https://github.com/Darshan-dlr/finparse-api.git
cd finparse-api
cp .env.example .env
# Edit .env with your values

2. Start with Docker Compose

# Start PostgreSQL + Redis + API + Worker
docker-compose up --build

# Or just the database (run API locally)
docker-compose up -d db redis

3. Run locally (without Docker)

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate          # Windows
source .venv/bin/activate       # Linux/macOS

# Install dependencies
pip install -r requirements.txt

# Start the API
uvicorn app.main:app --reload --port 8000

4. Access Swagger UI

http://localhost:8000/docs

API Documentation

Endpoints

Method	Path	Description
`POST`	`/api/v1/documents/upload`	Upload CSV/PDF for parsing
`GET`	`/api/v1/documents/`	List & filter documents
`GET`	`/api/v1/documents/{id}`	Get document + parsed data
`GET`	`/api/v1/documents/{id}/job`	Get latest job status
`DELETE`	`/api/v1/documents/{id}`	Soft-delete document
`GET`	`/health`	Health check

Upload Example

# Upload a CSV bank statement
curl -X POST http://localhost:8000/api/v1/documents/upload \
  -F "file=@tests/sample_files/standard_bank_statement.csv"

# Upload with duplicate handling
curl -X POST "http://localhost:8000/api/v1/documents/upload?allow_reprocess=true" \
  -F "file=@tests/sample_files/standard_bank_statement.csv"

# Upload a password-protected PDF
curl -X POST http://localhost:8000/api/v1/documents/upload \
  -F "file=@invoice.pdf" \
  -F "pdf_password=secret123"

Filter Example

# Filter by document type and status
curl "http://localhost:8000/api/v1/documents/?document_type=bank_statement&status=completed"

# Filter by currency
curl "http://localhost:8000/api/v1/documents/?currency=INR&limit=50"

Running Tests

# Install dependencies
pip install -r requirements.txt

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=app --cov-report=html

# Run just the CSV parser tests (no DB needed)
pytest tests/test_csv_parser.py -v

# Run specific test class
pytest tests/test_csv_parser.py::TestAmountParser -v

Test Coverage

Module	Tests
`amount_parser.py`	17 cases — all format variants
`date_parser.py`	13 cases — ISO, UK, US, Excel serial, ambiguous
`csv_parser.py`	13 integration cases — 3 real bank statement formats

Deployment

Environment Variables

Variable	Default	Description
`DATABASE_URL`	required	PostgreSQL async URL
`REDIS_URL`	`redis://localhost:6379/0`	Celery broker
`STORAGE_BACKEND`	`local`	`local` \| `s3` \| `gcs`
`MAX_PDF_SIZE_MB`	`50`	PDF upload limit
`MAX_CSV_SIZE_MB`	`25`	CSV upload limit
`MAX_PDF_PAGES`	`200`	Max pages per PDF
`MAX_CSV_ROWS`	`100000`	Max rows per CSV
`LOG_LEVEL`	`INFO`	`DEBUG` \| `INFO` \| `WARNING`
`ENVIRONMENT`	`development`	`development` \| `production`

Cloud Deployment (AWS)

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  API Gateway│────▶│  ECS Fargate │────▶│  RDS Postgres│
│  (or ALB)   │     │  (FastAPI)   │     │  (db.t4g.med)│
└─────────────┘     └──────┬───────┘     └──────────────┘
                           │
                    ┌──────▼───────┐     ┌──────────────┐
                    │  Celery      │────▶│  ElastiCache │
                    │  Workers     │     │  (Redis)     │
                    └──────────────┘     └──────────────┘
                           │
                    ┌──────▼───────┐
                    │  S3 Bucket   │  ← file storage
                    └──────────────┘

Quick Render deploy:

Connect GitHub repo to Render
Add PostgreSQL service
Set environment variables
Deploy with uvicorn app.main:app --host 0.0.0.0 --port $PORT

Project Structure

finparse-api/
├── app/
│   ├── main.py                    # FastAPI app + error handlers
│   ├── config.py                  # Settings (pydantic-settings)
│   ├── database.py                # Async SQLAlchemy engine
│   ├── dependencies.py            # FastAPI DI (DB session)
│   ├── core/
│   │   ├── exceptions.py          # 14 typed exception classes
│   │   └── logging.py             # structlog setup
│   ├── models/                    # SQLAlchemy ORM models
│   │   ├── document.py
│   │   ├── processing_job.py
│   │   ├── bank_statement.py      # BankStatement + BankTransaction
│   ├── parsers/
│   │   └── csv_parser.py          # 5-stage CSV pipeline (main focus)
│   ├── utils/
│   │   ├── amount_parser.py       # 9 amount format variants
│   │   └── date_parser.py         # 8 date format variants + Excel
│   ├── validators/
│   │   └── file_validator.py      # 3-stage file validation
│   ├── services/
│   │   └── document_service.py    # Upload + parse + persist
│   └── api/v1/
│       ├── router.py
│       └── endpoints/
│           └── documents.py
├── tests/
│   ├── sample_files/
│   │   ├── standard_bank_statement.csv
│   │   ├── european_semicolon_statement.csv
│   │   └── hdfc_style_statement.csv
│   └── test_csv_parser.py         # 43 test cases
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── .env.example

Known Limitations & Assumptions

Invoice parsing (PDF) — Fully implemented using a local heuristic and table extraction pipeline (pdfplumber). For complex/varying formats in production, see the Advanced ML/AI Suggestions below.
OCR for scanned PDFs — Detection is in place (ocr_needed flag), but Tesseract/cloud OCR integration is not yet connected. Scanned-only PDFs will raise an OCRFailedError.
Vendor deduplication — Exact case-insensitive match only. Fuzzy matching via pg_trgm or sentence embeddings is marked as TODO.
Cross-currency normalization — Amounts stored in original currency. Base currency conversion with exchange rates is marked as TODO.
Authentication — No auth implemented. Placeholder uploaded_by field ready for JWT/API key integration.
Background processing — Celery is wired in docker-compose.yml but parsing runs synchronously in the request for now. Move _parse_csv() and _parse_pdf() to Celery tasks for production.
Storage — Local disk storage only. S3/GCS integration hooks are in place (swap _save_to_storage() in DocumentService).
Multi-invoice PDFs — Schema supports invoice_index and page_range_* but multi-invoice splitting is not yet implemented in the parser.

Advanced ML/AI Document Parsing Suggestions

In production systems, rule-based heuristics and standard regex engines can fail if invoices change layout or use complex multi-column structures. Below are recommended modern approaches to achieve near-100% parsing accuracy:

Multimodal LLMs (Recommended):
- Use APIs like Gemini Flash / Pro or GPT-4o to parse document pages (as images or extracted text layouts).
- Use Pydantic schemas with LLM Structured Outputs (JSON Schema enforcement) to parse invoice totals, vendor details, and line-item tables with high accuracy, automatically handling diverse formats.
Fine-Tuned Layout/Visual Transformers:
- Use open-source transformer models like LayoutLM (v1/v2/v3) or Donut (OCR-free document understander) to map text and coordinates (bounding boxes) to structured tables and headers.
Document AI SaaS Engines:
- Integrate with pre-trained specialized models like Google Cloud Document AI (Invoice Parser) or AWS Textract (Analyze Expense). These services automatically resolve fields, line items, tax breakdowns, and currencies with built-in OCR.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
alembic		alembic
app		app
docs		docs
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation