A backend that turns public GitHub event data into analytical insights about repositories, organizations, and contributors.
A practice build in public: production-grade backend patterns — ingestion, analytics storage, minimal API — exercised on real GitHub event data. Early stage; the Status section is the source of truth for what works.
Events are pulled from GH Archive hourly JSON files, parsed, and stored in ClickHouse for analytical queries. A small HTTP API exposes aggregates, with a React dashboard on top.
GH Archive hourly files
│
▼
ingestor ──► ClickHouse ──► api ──► web dashboard
Early stage. What works today:
- GH Archive download with atomic temp-file-then-rename, retries, and GOAWAY handling.
- Event parsing with legacy-format tolerance (old slash-format
created_at, numericpublic). - Backfill: sequential historical ingest with a resumable cursor; hourly mode polls for the latest published file.
- Raw events inserted into ClickHouse.
GET /summary— global aggregate over all ingested events (row count, unique events/repos/actors, time range).- Minimal React/Vite dashboard.
Not yet built: typed per-event-type parsing, scoped per-repo/org endpoints, queue + workers, Postgres metadata, observability, GitHub API enrichment. See Roadmap.
- Go — ingestor and HTTP API
- ClickHouse — analytics storage (migrations via golang-migrate)
- React + TypeScript + Vite — dashboard (
web/)
- Go (see
go.modfor the version) - A running ClickHouse instance
- Node.js (for the web UI)
Copy the example env and adjust:
cp .env.example .env| Variable | Default | Purpose |
|---|---|---|
LISTEN_ADDR |
:8800 |
API listen address |
CLICKHOUSE_HOST |
localhost |
ClickHouse host |
CLICKHOUSE_PORT |
9000 |
Native protocol port |
CLICKHOUSE_USER |
default |
ClickHouse user |
CLICKHOUSE_PASSWORD |
(empty) | ClickHouse password |
CLICKHOUSE_DATABASE |
github_intel |
Database for reads/writes |
CLICKHOUSE_MIGRATIONS_DATABASE |
default |
Where schema_migrations is stored |
make run-migrate-clickhouse # apply all (up)
make run-migrate-clickhouse MIGRATE_ARGS=down# Ingest a specific local file
make run-ingestor INGESTOR_ARGS=data/2015-01-01-0.json.gz
# Poll and ingest the latest published hourly file
make run-ingestor-hourly
# Sequential historical backfill (resumable)
make run-ingestor-backfill BACKFILL_ARGS='-backfill-from=2015-01-01 -backfill-until=2015-01-02'make run-api # http://localhost:8800
make dev-api # with file-watch reloadEndpoints:
GET /summary— aggregate statisticsGET /healthz— health check
make web-install
make web-dev # Vite dev server, proxies /summary and /healthz to :8800
make web-build # production build into web/distmake build # compile binaries into bin/
make test # go test -short ./...
make test-race # go test -race ./...
make vet
make lint # requires golangci-lint
make fmt
make tidyDesign notes live in architecture/.
- Typed parsing for
PushEvent/PullRequestEvent/IssuesEvent/ etc. - Scoped endpoints:
/repos/:owner/:repo/summary, timeseries, contributors, PR latency;/orgs/:org/velocity;/trending/repos;/languages/trends. - Queue + worker pools, batched inserts, import-job tracking in Postgres, idempotency, graceful shutdown.
- Observability: structured logs, Prometheus metrics, OpenTelemetry traces, pprof, health/readiness.
- GitHub REST API enrichment (languages, topics, stars) with rate limiting.
- Performance work: profiling, schema/query tuning, throughput measurement.
Built on GH Archive, which publishes the public GitHub event stream as hourly JSON files. No scraping.