Convert any office document into clean, LLM-ready Markdown โ locally, instantly, privately.
AI Document Preprocessor is a standalone Windows desktop application that converts PDF, DOCX, PPTX, XLSX, HTML, and TXT files into optimised Markdown โ ready to paste into ChatGPT, Claude, Gemini, or any LLM prompt.
All processing happens 100% locally. No API keys. No subscriptions. No data leaves your machine.
| Feature | Description |
|---|---|
| ๐ PDF โ Markdown | Native text extraction via PyMuPDF; OCR fallback for scanned documents |
| ๐ DOCX โ Markdown | Full structure preservation via python-docx + mammoth |
| ๐ PPTX โ Markdown | Slide-by-slide extraction with speaker notes |
| ๐ XLSX โ Markdown | Spreadsheet โ formatted Markdown tables |
| ๐ HTML โ Markdown | Web page content extraction and cleaning |
| ๐ TXT โ Markdown | Direct passthrough with light formatting |
| ๐งน AI Cleaning | Removes noise, normalises headers, fixes list formatting |
| ๐๏ธ Live Preview | Raw / Clean / Rendered three-tab workspace |
| ๐ข Token Counter | Real-time token usage (tiktoken cl100k_base) |
| ๐ฆ Smart Export | Auto-names files, remembers last folder, handles duplicates |
| ๐ 100% Local | Zero telemetry, zero network calls, zero cloud dependency |
| Home โ Upload | Converting | Preview |
![]() |
![]() |
![]() |
- Download
AI-Document-Preprocessor-Setup.exefrom the Releases page - Double-click the installer and follow the wizard
- Launch AI Document Preprocessor from the Start menu or Desktop shortcut
- Download
AI-Document-Preprocessor-Portable.zipfrom the Releases page - Extract anywhere (USB drive, Desktop, etc.)
- Run
AI-Document-Preprocessor.exeโ no installation required
- Windows 10 / 11 (64-bit)
- 4 GB RAM recommended
- ~500 MB disk space
No Python installation required. Everything is bundled in the executable.
Click Browse or drag-and-drop any supported file onto the upload panel.
Supported formats: .pdf .docx .pptx .xlsx .html .txt
Click Convert and watch the animated progress overlay. Conversion typically takes under 1 second for most documents.
The workspace shows three tabs:
| Tab | Contents |
|---|---|
| Raw | Original extracted Markdown โ unchanged output |
| Clean | AI-optimised Markdown โ noise removed, structure fixed |
| Rendered | Live Markdown preview with syntax highlighting |
| Action | Behaviour |
|---|---|
| Export | Opens system save dialog, defaults to ~/Downloads, pre-fills filename |
| Copy Markdown | Copies cleaned Markdown to clipboard |
| Open Folder | Opens the export folder in Windows Explorer |
Auto-generated filenames:
resume.pdf โ resume_cleaned.md
slides.pptx โ slides_cleaned.md
report (2).docx โ report_cleaned (1).md (if duplicate)
ai-document-preprocessor/
โโโ app.py # Entry point, splash screen, crash handler
โโโ installer.py # GUI installer wizard
โโโ requirements.txt # Python dependencies
โโโ build_release.bat # Full release build pipeline
โโโ AI-Document-Preprocessor.spec # PyInstaller packaging configuration
โ
โโโ services/
โ โโโ converter.py # Format detection + conversion pipeline
โ โโโ cleaner.py # Markdown cleaning & LLM optimisation
โ โโโ enhancer.py # Token counting, diagnostics
โ โโโ cache_manager.py # Conversion result caching
โ
โโโ ui/
โ โโโ main_window.py # Main orchestrator + async pipeline
โ โโโ theme.py # Design tokens & colour system
โ โโโ components/
โ โโโ upload_panel.py # File drop / browse panel
โ โโโ preview_panel.py # Raw/Clean/Rendered tab workspace
โ โโโ loading_overlay.py # Animated conversion overlay
โ โโโ ... # Additional UI components
โ
โโโ assets/ # Application icons
โโโ docs/
โโโ screenshots/ # UI screenshots
File selected
โ
Extension-based type detection
โ
Format-specific converter
(PdfConverter / DocxConverter / PptxConverter / โฆ)
โ
Raw Markdown extraction
โ
AI cleaning pass (cleaner.py)
โ
Token counting (tiktoken)
โ
Render to workspace preview
All heavy work runs in a ThreadPoolExecutor off the UI thread.
A 60-second timeout guard prevents indefinite hangs.
| Layer | Technology |
|---|---|
| UI Framework | Flet 0.21+ (Flutter-based Python UI) |
| PDF Extraction | PyMuPDF (fitz) |
| Document Parsing | MarkItDown (Microsoft) |
| DOCX | python-docx + mammoth |
| PPTX | python-pptx |
| XLSX | openpyxl + pandas |
| Token Counting | tiktoken (OpenAI) |
| OCR (optional) | RapidOCR + ONNX Runtime |
| Packaging | PyInstaller 6.x |
| CI/CD | GitHub Actions |
- Windows only in v1.0 (macOS / Linux planned for v1.1)
- Scanned PDFs without embedded text rely on OCR โ accuracy varies
- Very large files (>50 MB) may take 10โ30 seconds
- Complex XLSX merged cells may lose some structure in conversion
- Password-protected documents are not supported
See ROADMAP.md for planned features including:
- Cross-platform (macOS, Linux)
- Batch / folder processing
- Optional LLM-assisted cleaning
- Additional format support (
.eml,.epub, images) - REST API mode
Contributions are welcome! Please read CONTRIBUTING.md to get started.
# Quick start for developers
git clone https://github.com/YOUR_USERNAME/ai-document-preprocessor.git
cd ai-document-preprocessor
pip install -r requirements.txt
python app.pyThis project is licensed under the MIT License โ see LICENSE for details.
See CHANGELOG.md for a full history of changes.
Made with โค๏ธ for the AI / LLM developer community
โญ Star this repo if you find it useful!


