Skip to content

deepan8545/LLM-Observability

Repository files navigation

Project 2 — LLM Monitoring & Observability

Adds full observability to the Project 1 RAG pipeline:

  • Langfuse tracing — one trace per query with spans, token counts, cost
  • Latency tracking — p50/p95/p99 per pipeline stage (retrieval, reranking, generation)
  • Cost-per-request — dollar cost calculated from token counts, logged to Langfuse
  • LLM-as-judge — async quality scoring (1-5) after every response
  • CI regression gate — GitHub Actions benchmark blocks PRs if p95 or quality drops

Setup

Step 1 — Install dependencies

py -3.11 -m venv venv
source venv/Scripts/activate    # Git Bash on Windows
pip install -r requirements.txt

Step 2 — Start Langfuse locally (Docker required)

Install Docker Desktop from https://docker.com/products/docker-desktop

Then:

docker compose -f docker-compose.langfuse.yml up -d

Open http://localhost:3000 in your browser. Create an account → go to Settings → API Keys → create a new key pair. Copy the public key and secret key.

Step 3 — Configure .env

cp .env.example .env

Fill in your keys:

ANTHROPIC_API_KEY=your_key
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your_password
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=http://localhost:3000

Step 4 — Copy Project 1 modules

Copy these folders from Project 1 into this project:

app/ingestion/
app/retrieval/
app/generation/
app/evaluation/
data/documents/
eval/golden_set/

Step 5 — Run ingestion

python -m app.ingestion.ingest --docs-dir data/documents

Step 6 — Start the API

uvicorn app.main:app --reload --port 8000

Step 7 — Test a query

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is Reciprocal Rank Fusion?"}'

Response now includes latency_ms and cost_usd.

Step 8 — View your traces

Open http://localhost:3000 → Traces. You should see one trace per query with full span breakdown.

Step 9 — Check live metrics

curl http://localhost:8000/metrics

Returns p50/p95/p99 latency per stage and cost statistics.

Step 10 — Run the benchmark

python scripts/benchmark.py

Exits 0 (pass) or 1 (fail) based on p95 latency and quality thresholds.


GitHub Actions CI

Add these secrets to your repo (Settings → Secrets → Actions):

  • ANTHROPIC_API_KEY
  • NEO4J_URI
  • NEO4J_PASSWORD
  • LANGFUSE_PUBLIC_KEY
  • LANGFUSE_SECRET_KEY
  • LANGFUSE_HOST (use your cloud Langfuse URL if not self-hosted)

Every PR now runs both the RAGAS eval (Project 1) and the latency/quality benchmark (Project 2).


New API Endpoints

Endpoint Method Description
/query POST Full RAG pipeline with tracing + cost tracking
/metrics GET Live p50/p95/p99 latency + cost stats
/health GET Liveness + Langfuse enabled status

Project Structure

observability/
├── app/
│   ├── config.py                        # Extended config with Langfuse settings
│   ├── main.py                          # FastAPI with observability instrumentation
│   ├── observability/
│   │   ├── tracer.py                    # Langfuse trace context manager
│   │   ├── latency.py                   # p50/p95/p99 decorator + tracker
│   │   ├── cost.py                      # Cost-per-request calculator
│   │   └── judge.py                     # LLM-as-judge async quality scorer
│   ├── ingestion/                       # (copied from Project 1)
│   ├── retrieval/                       # (copied from Project 1)
│   ├── generation/                      # (copied from Project 1)
│   └── evaluation/                      # (copied from Project 1)
├── scripts/
│   └── benchmark.py                     # CI regression gate
├── .github/workflows/
│   └── benchmark.yml                    # GitHub Actions CI
├── docker-compose.langfuse.yml          # Local Langfuse setup
├── requirements.txt
└── .env.example

About

Production LLM observability layer — Langfuse tracing, p50/p95/p99 latency, cost-per-request, LLM-as-judge quality scoring (4.4/5), and CI benchmark gate on every PR.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages