AdRAGSearch is a lightweight single-agent, tool-augmented RAG application for asking questions over a mixed knowledge base of web pages and local documents. The current MVP combines Streamlit for the user interface, LangChain for document processing and tool integration, FAISS for semantic retrieval, LangGraph for workflow orchestration, and OpenAI models for embeddings and answer generation.
This repository is a strong foundation for document Q&A assistants, internal knowledge search, research copilots, and retrieval-based AI product experiments.
The app loads content from:
- configured web URLs
- local PDF files in the
data/directory - local
.txtfiles through the ingestion layer when you add them as sources
That content is split into chunks, embedded with OpenAI embeddings, stored in a FAISS vector index, and queried through a LangGraph workflow. The answer generation step uses an agent that can choose between:
- a local retriever tool for indexed project documents
- a Wikipedia tool for broader public knowledge
The system is agentic because the responder can choose tools, but it is still a single-agent design rather than a multi-agent system.
The result is a simple but useful hybrid RAG experience: grounded answers from your indexed corpus, with the ability to reach beyond it when the question needs general context.
RAG improves LLM answers by retrieving relevant source content first, then using that context during generation. In this repo, the retrieved context comes from a FAISS vector store built from your document corpus.
This repo implements single-agent agentic RAG rather than multi-agent orchestration. The answering step is tool-enabled. The model can decide whether to use:
- the internal retriever for indexed documents
- Wikipedia for general background knowledge
That makes the system more flexible than a basic retrieve-then-answer pipeline.
LangGraph is used to define the execution flow as explicit nodes:
- retrieve relevant documents
- generate the final answer
This makes the application easier to extend later with grading, query rewriting, routing, memory, or human-in-the-loop steps.
- Mixed-source ingestion from web URLs, PDF directories, single PDFs, and text files
- Recursive chunking for document preprocessing
- OpenAI embeddings with FAISS-backed semantic retrieval
- LangGraph-based orchestration for a clean RAG execution flow
- Single-agent answer generation with tool access to both the retriever and Wikipedia
- Streamlit UI with question input, answer display, source previews, and recent search history
- Cached system initialization to avoid rebuilding the full pipeline on every interaction
- Better grounded answers than plain chat because responses can use indexed source material
- More flexible than basic RAG because the single answering agent can use both indexed documents and Wikipedia when needed
- Easy to understand and demo because the architecture is small and the UI is simple
- Easy to extend into a more advanced AI product because ingestion, retrieval, state, graph, and UI are already separated into modules
- Useful as an MVP foundation for internal search assistants, document copilots, and research exploration tools
flowchart TD
A[Configured URLs + local data directory] --> B[DocumentProcessor]
B --> C[Chunked documents]
C --> D[OpenAI Embeddings]
D --> E[FAISS Vector Store]
E --> F[Retriever]
F --> G[LangGraph: retriever node]
G --> H[LangGraph: responder node]
H --> I[Tool-enabled agent]
I --> J[Retriever tool]
I --> K[Wikipedia tool]
H --> L[Final answer + source references]
L --> M[Streamlit UI]
streamlit_app.pyinitializes the system on first load.DocumentProcessorloads configured URLs and the localdata/directory.- Documents are split into chunks using
RecursiveCharacterTextSplitter. VectorStoreembeds the chunks withOpenAIEmbeddingsand stores them in FAISS.GraphBuildercreates a two-step LangGraph workflow:retrieverresponder
- The
respondernode builds a tool-using agent with:- a retriever tool over the FAISS index
- a Wikipedia lookup tool
- The Streamlit UI sends a user question into the graph and displays:
- the final answer
- indexed document chunks used for retrieval
- external references captured during tool use
- recent search history
The current UI is intentionally minimal and demo-friendly:
- A centered Streamlit page for quick question answering
- Automatic startup and document indexing on first load
- A single search box and submit button
- Answer output shown immediately after processing
- A
Sources Usedexpander showing indexed document chunks and external references - A recent-search section showing the last few queries and answers
This makes the project easy to demo without introducing extra UI complexity.
Out of the box, the app uses:
- two configured blog URLs in
src/config/config.py - the local
data/directory as the default PDF source
The repository currently includes sample content such as:
data/Attentionisallyouneed.pdf
To change the default knowledge base, update:
Config.DEFAULT_URLSConfig.DEFAULT_PDF_DIR
AdRAGSearch/
|-- streamlit_app.py # Main Streamlit app and UI flow
|-- main.py # Minimal placeholder entry point
|-- data/ # Sample local documents
|-- src/
| |-- config/config.py # Model and source configuration
| |-- document_ingestion/ # Source loading and chunking
| |-- vectorstore/ # Embeddings and FAISS retrieval
| |-- state/rag_state.py # Shared graph state
| |-- nodes/reactnode.py # Retrieval + agent answer nodes
| |-- graph_builder/graph_builder.py # LangGraph workflow assembly
|-- pyproject.toml # Dependencies and package metadata
- Python 3.13+
- Streamlit
- LangChain
- LangGraph
- OpenAI API
- FAISS
- PyPDF / PyMuPDF
- BeautifulSoup / WebBaseLoader
- Wikipedia tool integration
git clone <your-repo-url>
cd AdRAGSearchThis repository is currently set up around uv:
uv syncCreate a .env file in the project root:
OPENAI_API_KEY=your_openai_api_keystreamlit run streamlit_app.pyThen open the local Streamlit URL in your browser.
Most important settings live in src/config/config.py:
LLM_MODELcontrols the chat model used for answer generationCHUNK_SIZEcontrols document chunk sizeCHUNK_OVERLAPcontrols overlap between chunksDEFAULT_URLScontrols which web sources are indexed at startupDEFAULT_PDF_DIRcontrols which local directory is scanned for PDFs
- Search across research papers and related web content
- Build a lightweight internal knowledge assistant
- Prototype agentic document Q&A workflows
- Demonstrate LangGraph plus Streamlit in a small end-to-end application
- Use as a starter repo for more advanced retrieval systems
The current implementation is intentionally small and focused.
- The vector store is built at startup and kept in memory rather than persisted
- Source selection is configuration-driven, not user-upload driven
- External source capture currently covers Wikipedia lookups, not general live internet search
- There is no authentication, database, background job system, or observability layer
- There are no automated tests in the repo yet
streamlit_app.pyis the primary entry point
If you want to publish a clean GitHub project that demonstrates practical AI product building, this repo already shows several valuable ideas in a compact form:
- document ingestion from multiple source types
- semantic retrieval with embeddings and FAISS
- explicit workflow orchestration with LangGraph
- single-agent tool use layered on top of RAG
- a working UI that makes the system immediately demoable
That combination makes it a good showcase project for single-agent agentic AI, applied RAG, and end-to-end product prototyping.