AdRAGSearch

AdRAGSearch is a lightweight single-agent, tool-augmented RAG application for asking questions over a mixed knowledge base of web pages and local documents. The current MVP combines Streamlit for the user interface, LangChain for document processing and tool integration, FAISS for semantic retrieval, LangGraph for workflow orchestration, and OpenAI models for embeddings and answer generation.

This repository is a strong foundation for document Q&A assistants, internal knowledge search, research copilots, and retrieval-based AI product experiments.

Overview

The app loads content from:

configured web URLs
local PDF files in the data/ directory
local .txt files through the ingestion layer when you add them as sources

That content is split into chunks, embedded with OpenAI embeddings, stored in a FAISS vector index, and queried through a LangGraph workflow. The answer generation step uses an agent that can choose between:

a local retriever tool for indexed project documents
a Wikipedia tool for broader public knowledge

The system is agentic because the responder can choose tools, but it is still a single-agent design rather than a multi-agent system.

The result is a simple but useful hybrid RAG experience: grounded answers from your indexed corpus, with the ability to reach beyond it when the question needs general context.

Concepts Behind The Repo

Retrieval-Augmented Generation (RAG)

RAG improves LLM answers by retrieving relevant source content first, then using that context during generation. In this repo, the retrieved context comes from a FAISS vector store built from your document corpus.

Agentic RAG

This repo implements single-agent agentic RAG rather than multi-agent orchestration. The answering step is tool-enabled. The model can decide whether to use:

the internal retriever for indexed documents
Wikipedia for general background knowledge

That makes the system more flexible than a basic retrieve-then-answer pipeline.

Graph-Orchestrated Workflow

LangGraph is used to define the execution flow as explicit nodes:

retrieve relevant documents
generate the final answer

This makes the application easier to extend later with grading, query rewriting, routing, memory, or human-in-the-loop steps.

Key Features

Mixed-source ingestion from web URLs, PDF directories, single PDFs, and text files
Recursive chunking for document preprocessing
OpenAI embeddings with FAISS-backed semantic retrieval
LangGraph-based orchestration for a clean RAG execution flow
Single-agent answer generation with tool access to both the retriever and Wikipedia
Streamlit UI with question input, answer display, source previews, and recent search history
Cached system initialization to avoid rebuilding the full pipeline on every interaction

Benefits

Better grounded answers than plain chat because responses can use indexed source material
More flexible than basic RAG because the single answering agent can use both indexed documents and Wikipedia when needed
Easy to understand and demo because the architecture is small and the UI is simple
Easy to extend into a more advanced AI product because ingestion, retrieval, state, graph, and UI are already separated into modules
Useful as an MVP foundation for internal search assistants, document copilots, and research exploration tools

How It Works

flowchart TD
    A[Configured URLs + local data directory] --> B[DocumentProcessor]
    B --> C[Chunked documents]
    C --> D[OpenAI Embeddings]
    D --> E[FAISS Vector Store]
    E --> F[Retriever]
    F --> G[LangGraph: retriever node]
    G --> H[LangGraph: responder node]
    H --> I[Tool-enabled agent]
    I --> J[Retriever tool]
    I --> K[Wikipedia tool]
    H --> L[Final answer + source references]
    L --> M[Streamlit UI]

Runtime Flow

streamlit_app.py initializes the system on first load.
DocumentProcessor loads configured URLs and the local data/ directory.
Documents are split into chunks using RecursiveCharacterTextSplitter.
VectorStore embeds the chunks with OpenAIEmbeddings and stores them in FAISS.
GraphBuilder creates a two-step LangGraph workflow:
- retriever
- responder
The responder node builds a tool-using agent with:
- a retriever tool over the FAISS index
- a Wikipedia lookup tool
The Streamlit UI sends a user question into the graph and displays:
- the final answer
- indexed document chunks used for retrieval
- external references captured during tool use
- recent search history

UI Summary

The current UI is intentionally minimal and demo-friendly:

A centered Streamlit page for quick question answering
Automatic startup and document indexing on first load
A single search box and submit button
Answer output shown immediately after processing
A Sources Used expander showing indexed document chunks and external references
A recent-search section showing the last few queries and answers

This makes the project easy to demo without introducing extra UI complexity.

Default Data Sources

Out of the box, the app uses:

two configured blog URLs in src/config/config.py
the local data/ directory as the default PDF source

The repository currently includes sample content such as:

data/Attentionisallyouneed.pdf

To change the default knowledge base, update:

Config.DEFAULT_URLS
Config.DEFAULT_PDF_DIR

Project Structure

AdRAGSearch/
|-- streamlit_app.py                  # Main Streamlit app and UI flow
|-- main.py                           # Minimal placeholder entry point
|-- data/                             # Sample local documents
|-- src/
|   |-- config/config.py              # Model and source configuration
|   |-- document_ingestion/           # Source loading and chunking
|   |-- vectorstore/                  # Embeddings and FAISS retrieval
|   |-- state/rag_state.py            # Shared graph state
|   |-- nodes/reactnode.py            # Retrieval + agent answer nodes
|   |-- graph_builder/graph_builder.py # LangGraph workflow assembly
|-- pyproject.toml                    # Dependencies and package metadata

Tech Stack

Python 3.13+
Streamlit
LangChain
LangGraph
OpenAI API
FAISS
PyPDF / PyMuPDF
BeautifulSoup / WebBaseLoader
Wikipedia tool integration

Setup

1. Clone The Repository

git clone <your-repo-url>
cd AdRAGSearch

2. Install Dependencies

This repository is currently set up around uv:

uv sync

3. Add Environment Variables

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key

4. Run The App

streamlit run streamlit_app.py

Then open the local Streamlit URL in your browser.

Configuration

Most important settings live in src/config/config.py:

LLM_MODEL controls the chat model used for answer generation
CHUNK_SIZE controls document chunk size
CHUNK_OVERLAP controls overlap between chunks
DEFAULT_URLS controls which web sources are indexed at startup
DEFAULT_PDF_DIR controls which local directory is scanned for PDFs

Example Use Cases

Search across research papers and related web content
Build a lightweight internal knowledge assistant
Prototype agentic document Q&A workflows
Demonstrate LangGraph plus Streamlit in a small end-to-end application
Use as a starter repo for more advanced retrieval systems

Current Scope And Limitations

The current implementation is intentionally small and focused.

The vector store is built at startup and kept in memory rather than persisted
Source selection is configuration-driven, not user-upload driven
External source capture currently covers Wikipedia lookups, not general live internet search
There is no authentication, database, background job system, or observability layer
There are no automated tests in the repo yet
streamlit_app.py is the primary entry point

Why This Repo Is Useful

If you want to publish a clean GitHub project that demonstrates practical AI product building, this repo already shows several valuable ideas in a compact form:

document ingestion from multiple source types
semantic retrieval with embeddings and FAISS
explicit workflow orchestration with LangGraph
single-agent tool use layered on top of RAG
a working UI that makes the system immediately demoable

That combination makes it a good showcase project for single-agent agentic AI, applied RAG, and end-to-end product prototyping.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
TECHNICAL_INTERVIEW_QA.md		TECHNICAL_INTERVIEW_QA.md
pyproject.toml		pyproject.toml
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdRAGSearch

Overview

Concepts Behind The Repo

Retrieval-Augmented Generation (RAG)

Agentic RAG

Graph-Orchestrated Workflow

Key Features

Benefits

How It Works

Runtime Flow

UI Summary

Default Data Sources

Project Structure

Tech Stack

Setup

1. Clone The Repository

2. Install Dependencies

3. Add Environment Variables

4. Run The App

Configuration

Example Use Cases

Current Scope And Limitations

Why This Repo Is Useful

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AdRAGSearch

Overview

Concepts Behind The Repo

Retrieval-Augmented Generation (RAG)

Agentic RAG

Graph-Orchestrated Workflow

Key Features

Benefits

How It Works

Runtime Flow

UI Summary

Default Data Sources

Project Structure

Tech Stack

Setup

1. Clone The Repository

2. Install Dependencies

3. Add Environment Variables

4. Run The App

Configuration

Example Use Cases

Current Scope And Limitations

Why This Repo Is Useful

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages