A lightweight yet powerful PDF analysis tool designed to break down large documents using semantic sentence-based chunking and extract meaningful summaries using the Gemini 2.5 Flash API. Built to handle long PDFs while keeping the final output natural, readable, and non-AI sounding.
-
Sentence-Aware Chunking: Uses NLTK to split the document only at sentence boundaries, ensuring the flow of ideas stays intact.
-
Chunk-Based AI Analysis: Each chunk is summarized with Gemini 2.5 Flash using a structured, human-friendly prompt.
-
Merged Final Summary: All chunk summaries are combined and refined into one smooth, coherent final document.
-
Clean Output: Summaries include simple explanations, bullet notes, key terms, and helpful insights.
-
No Image Extraction: Pure text-based PDF analysis for clean and reliable results.
-
Colab-Ready: Fully optimized to run in Google Colab with file upload support.
-
Google Colab (recommended)
-
Gemini API Key
-
Python libraries:
-
NLTK
-
PyPDF2
-
google-genai
-
-
Upload a PDF file.
-
Extract all text from the PDF.
-
Automatically break the text into semantic chunks.
-
Send each chunk to Gemini 2.5 Flash for analysis.
-
Collect and merge all summaries into a final, polished output.
-
Display the full summary in a human-friendly format.
-
Handles long PDFs beyond LLM token limits
-
Prevents mid-sentence cuts using smart chunking
-
Produces structured, easy-to-read summaries
-
Ensures natural writing instead of robotic AI tone
-
Academic notes
-
Research paper analysis
-
Business document summaries
-
Book breakdowns
-
Report simplification
Your API key is not stored inside the notebook when shared — only you must enter it manually for every new session.
MIT License — free to use and modify.