Skip to content

joel8779/ai-document-preprocessor

Repository files navigation

๐Ÿ—‚๏ธ AI Document Preprocessor

Convert any office document into clean, LLM-ready Markdown โ€” locally, instantly, privately.

License: MIT Platform Release Python

Home Screenshot


Overview

AI Document Preprocessor is a standalone Windows desktop application that converts PDF, DOCX, PPTX, XLSX, HTML, and TXT files into optimised Markdown โ€” ready to paste into ChatGPT, Claude, Gemini, or any LLM prompt.

All processing happens 100% locally. No API keys. No subscriptions. No data leaves your machine.


โœจ Features

Feature Description
๐Ÿ“„ PDF โ†’ Markdown Native text extraction via PyMuPDF; OCR fallback for scanned documents
๐Ÿ“ DOCX โ†’ Markdown Full structure preservation via python-docx + mammoth
๐Ÿ“Š PPTX โ†’ Markdown Slide-by-slide extraction with speaker notes
๐Ÿ“ˆ XLSX โ†’ Markdown Spreadsheet โ†’ formatted Markdown tables
๐ŸŒ HTML โ†’ Markdown Web page content extraction and cleaning
๐Ÿ“ƒ TXT โ†’ Markdown Direct passthrough with light formatting
๐Ÿงน AI Cleaning Removes noise, normalises headers, fixes list formatting
๐Ÿ‘๏ธ Live Preview Raw / Clean / Rendered three-tab workspace
๐Ÿ”ข Token Counter Real-time token usage (tiktoken cl100k_base)
๐Ÿ“ฆ Smart Export Auto-names files, remembers last folder, handles duplicates
๐Ÿ”’ 100% Local Zero telemetry, zero network calls, zero cloud dependency

Screenshots

Home โ€” Upload Converting Preview
Home screen Converting screen Preview screen

๐Ÿ“ฆ Installation

Option A โ€” Installer (Recommended)

  1. Download AI-Document-Preprocessor-Setup.exe from the Releases page
  2. Double-click the installer and follow the wizard
  3. Launch AI Document Preprocessor from the Start menu or Desktop shortcut

Option B โ€” Portable

  1. Download AI-Document-Preprocessor-Portable.zip from the Releases page
  2. Extract anywhere (USB drive, Desktop, etc.)
  3. Run AI-Document-Preprocessor.exe โ€” no installation required

System Requirements

  • Windows 10 / 11 (64-bit)
  • 4 GB RAM recommended
  • ~500 MB disk space

No Python installation required. Everything is bundled in the executable.


๐Ÿš€ Usage

1. Upload a Document

Click Browse or drag-and-drop any supported file onto the upload panel.

Supported formats: .pdf .docx .pptx .xlsx .html .txt

2. Convert

Click Convert and watch the animated progress overlay. Conversion typically takes under 1 second for most documents.

3. Preview

The workspace shows three tabs:

Tab Contents
Raw Original extracted Markdown โ€” unchanged output
Clean AI-optimised Markdown โ€” noise removed, structure fixed
Rendered Live Markdown preview with syntax highlighting

4. Export

Action Behaviour
Export Opens system save dialog, defaults to ~/Downloads, pre-fills filename
Copy Markdown Copies cleaned Markdown to clipboard
Open Folder Opens the export folder in Windows Explorer

Auto-generated filenames:

resume.pdf          โ†’  resume_cleaned.md
slides.pptx         โ†’  slides_cleaned.md
report (2).docx     โ†’  report_cleaned (1).md   (if duplicate)

๐Ÿ—๏ธ Architecture

ai-document-preprocessor/
โ”œโ”€โ”€ app.py                            # Entry point, splash screen, crash handler
โ”œโ”€โ”€ installer.py                      # GUI installer wizard
โ”œโ”€โ”€ requirements.txt                  # Python dependencies
โ”œโ”€โ”€ build_release.bat                 # Full release build pipeline
โ”œโ”€โ”€ AI-Document-Preprocessor.spec     # PyInstaller packaging configuration
โ”‚
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ converter.py                  # Format detection + conversion pipeline
โ”‚   โ”œโ”€โ”€ cleaner.py                    # Markdown cleaning & LLM optimisation
โ”‚   โ”œโ”€โ”€ enhancer.py                   # Token counting, diagnostics
โ”‚   โ””โ”€โ”€ cache_manager.py              # Conversion result caching
โ”‚
โ”œโ”€โ”€ ui/
โ”‚   โ”œโ”€โ”€ main_window.py                # Main orchestrator + async pipeline
โ”‚   โ”œโ”€โ”€ theme.py                      # Design tokens & colour system
โ”‚   โ””โ”€โ”€ components/
โ”‚       โ”œโ”€โ”€ upload_panel.py           # File drop / browse panel
โ”‚       โ”œโ”€โ”€ preview_panel.py          # Raw/Clean/Rendered tab workspace
โ”‚       โ”œโ”€โ”€ loading_overlay.py        # Animated conversion overlay
โ”‚       โ””โ”€โ”€ ...                       # Additional UI components
โ”‚
โ”œโ”€โ”€ assets/                           # Application icons
โ””โ”€โ”€ docs/
    โ””โ”€โ”€ screenshots/                  # UI screenshots

Conversion Pipeline

File selected
    โ†“
Extension-based type detection
    โ†“
Format-specific converter
(PdfConverter / DocxConverter / PptxConverter / โ€ฆ)
    โ†“
Raw Markdown extraction
    โ†“
AI cleaning pass (cleaner.py)
    โ†“
Token counting (tiktoken)
    โ†“
Render to workspace preview

All heavy work runs in a ThreadPoolExecutor off the UI thread. A 60-second timeout guard prevents indefinite hangs.


๐Ÿ› ๏ธ Tech Stack

Layer Technology
UI Framework Flet 0.21+ (Flutter-based Python UI)
PDF Extraction PyMuPDF (fitz)
Document Parsing MarkItDown (Microsoft)
DOCX python-docx + mammoth
PPTX python-pptx
XLSX openpyxl + pandas
Token Counting tiktoken (OpenAI)
OCR (optional) RapidOCR + ONNX Runtime
Packaging PyInstaller 6.x
CI/CD GitHub Actions

โš ๏ธ Known Limitations

  • Windows only in v1.0 (macOS / Linux planned for v1.1)
  • Scanned PDFs without embedded text rely on OCR โ€” accuracy varies
  • Very large files (>50 MB) may take 10โ€“30 seconds
  • Complex XLSX merged cells may lose some structure in conversion
  • Password-protected documents are not supported

๐Ÿ”ฎ Roadmap

See ROADMAP.md for planned features including:

  • Cross-platform (macOS, Linux)
  • Batch / folder processing
  • Optional LLM-assisted cleaning
  • Additional format support (.eml, .epub, images)
  • REST API mode

๐Ÿค Contributing

Contributions are welcome! Please read CONTRIBUTING.md to get started.

# Quick start for developers
git clone https://github.com/YOUR_USERNAME/ai-document-preprocessor.git
cd ai-document-preprocessor
pip install -r requirements.txt
python app.py

๐Ÿ“„ License

This project is licensed under the MIT License โ€” see LICENSE for details.


๐Ÿ“‹ Changelog

See CHANGELOG.md for a full history of changes.


Made with โค๏ธ for the AI / LLM developer community

โญ Star this repo if you find it useful!

About

Windows desktop tool to parse PDF, DOCX, and XLSX files to LLM-ready Markdown using PyMuPDF and local ONNX OCR.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors