WikiPulse V3 is a pragmatic exploration of real-time streaming system behavior under simulated load.
This repository centers on:
- streaming pipeline exploration
- backpressure and consumer scaling behavior
- PostgreSQL serving both OLTP and analytical endpoints through targeted indexing and materialized-view style rollups
The reliability posture is loss-minimized, at-least-once processing, not absolute delivery guarantees.
Architecture walkthrough placeholder:
- Full walkthrough and scaling demo:
INSERT_LINK_LATER
- What This Project Explores
- System Architecture Deep Dive
- Reliability Model
- React Dashboard
- Observability Stack
- Local Docker Fast-Start
- Kubernetes Deployment and Live Demo
- Testing Methodology
- Roadmap
Wikimedia recentchange is bursty by nature. Traffic can spike quickly, and stream systems need to absorb those bursts without stalling consumers or collapsing write latency.
WikiPulse is intentionally built to explore those trade-offs:
- Kafka buffering and partition-level backpressure behavior
- Manual-ack consumer flow with deduplication and dead-letter safety
- PostgreSQL dual role:
- durable write path for processed edits
- analytical serving path for
/api/analytics/*and/api/v2/analytics/*
- React + WebSocket user experience under continuous inbound traffic
WikipediaSseClientconsumes Wikimedia SSE and publishes normalizedWikiEditEventrecords to Kafka (wiki-edits).WikiEditConsumerprocesses records with manual acknowledgments.- Redis handles:
- dedup lock (
edit:processed:<id>, 24h TTL) - velocity signal (
bot:velocity:<user>, 60s TTL)
- dedup lock (
ProcessedEditServicepersists records to PostgreSQL and broadcasts live updates.AnalyticsAggregationServicerolls recent windows intoanalytics_rollupsfor fast grouped reads.- Analytics APIs read from PostgreSQL:
- rollup-driven endpoints (
/api/analytics/*) - indexed direct queries for deep analytics (
/api/v2/analytics/*)
- rollup-driven endpoints (
- Kafka Streams anomaly topology emits
wiki-anomalies, which the UI receives through/topic/anomalies.
flowchart LR
A[Wikimedia SSE Stream] --> B[Spring WebFlux SSE Client]
B --> C[Kafka topic: wiki-edits]
C --> D[Worker Consumer Group]
D --> E[Redis Dedup and Velocity]
E --> F[PostgreSQL table: processed_edits]
F --> G[PostgreSQL table: analytics_rollups]
G --> H[REST /api/analytics/*]
F --> I[REST /api/v2/analytics/*]
C --> J[Kafka Streams Anomaly Topology]
J --> K[Kafka topic: wiki-anomalies]
K --> L[Anomaly Consumer]
F --> M[WebSocket /topic/edits]
L --> N[WebSocket /topic/anomalies]
H --> O[React Dashboard]
I --> O
M --> O
N --> O
P[Actuator and Micrometer] --> Q[Prometheus]
Q --> R[Grafana]
PostgreSQL currently handles transactional writes and analytics reads by combining:
- write-optimized event persistence in
processed_edits - targeted indexes for filter-heavy analytical endpoints
- materialized-view style rollups (
analytics_rollups) refreshed on a schedule
The worker follows a save-before-ack contract:
- Read from Kafka
- Validate and deduplicate
- Enrich (complexity and bot velocity)
- Persist to PostgreSQL
- Publish WebSocket update
- Acknowledge Kafka offset
This yields loss-minimized, at-least-once processing semantics with replay tolerance.
Malformed payloads are isolated through retry plus dead-letter routing (wiki-edits-dlt) instead of freezing partitions.
The frontend is split into two operator views:
AnalyticsOverviewTab- KPI snapshots, language/namespace/bot distributions, trend chart
- Poll-driven reads from
/api/analytics/*and/api/v2/analytics/* - Live anomaly ticker from
/topic/anomalies
LiveFirehoseTab- Initial hydration from
/api/edits/recent - Continuous stream updates from
/topic/edits - Fixed-size rolling window to cap memory/render overhead
- Initial hydration from
Telemetry path:
- Spring Actuator exposes
/actuator/prometheus - Prometheus scrapes worker + container telemetry
- Grafana visualizes throughput, lag, CPU, memory, and service health
Representative metrics:
wikipulse_edits_processed_totalwikipulse_bots_detected_totalwikipulse_errors_totalwikipulse_processing_latencykafka_consumer_fetch_manager_records_lag
- Docker Desktop with Compose v2
- Recommended minimum 8 GB RAM
- Ports available:
3000,3001,5432,6379,8080,9090,9092
docker compose up -d --builddocker compose ps
curl http://localhost:8080/actuator/health- React Dashboard:
http://localhost:3000 - Grafana:
http://localhost:3001 - Prometheus:
http://localhost:9090 - Worker metrics:
http://localhost:8080/actuator/prometheus
docker compose downminikube start --cpus=4 --memory=8192
minikube addons enable metrics-serverminikube -p minikube docker-env --shell powershell | Invoke-Expression
docker build -t wikipulse-ingestor:latest .
docker build -t wikipulse-frontend:latest ./frontendkubectl apply -f k8s/infrastructure.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/frontend-deployment.yaml
kubectl apply -f k8s/hpa.yamlkubectl rollout status deployment/wikipulse-workerkubectl get pods -o wide
kubectl get hpa wikipulse-worker-hpa
kubectl top pods -l app=wikipulse-workerminikube service wikipulse-frontendbash scripts/k8s_load_test.shWatchers:
kubectl get hpa wikipulse-worker-hpa -w
kubectl get pods -l app=wikipulse-worker -w
kubectl describe hpa wikipulse-worker-hpakubectl delete -f k8s/hpa.yaml --ignore-not-found
kubectl delete -f k8s/frontend-deployment.yaml --ignore-not-found
kubectl delete -f k8s/service.yaml --ignore-not-found
kubectl delete -f k8s/deployment.yaml --ignore-not-found
kubectl delete -f k8s/secret.yaml --ignore-not-found
kubectl delete -f k8s/configmap.yaml --ignore-not-found
kubectl delete -f k8s/infrastructure.yaml --ignore-not-found
minikube stopTesting focuses on behavior under realistic infrastructure, not mocked-only happy paths.
- Unit tests for deterministic utility and rule logic
- Integration tests with Testcontainers for Kafka, PostgreSQL, and Redis
- API and metrics assertions for end-to-end confidence
Run:
.\mvnw clean compile
.\mvnw test- Add schema governance and evolution checks for event contracts
- Expand anomaly rules with multi-window baselines and confidence scoring
- Add deterministic replay/load harness for repeatable scaling experiments
- Package Kubernetes deployment as Helm chart overlays
WikiPulse V3 is a working sandbox for streaming architecture trade-offs, not a claim of perfect reliability. It is designed to make operational behavior visible: queue pressure, consumer lag, scaling response, and analytics query cost.
For full architecture rationale and explicit trade-off notes, see ARCHITECTURE.md.