xForge V2 — Provider-Agnostic Football Analytics Pipeline

End-to-end football analytics pipeline engineered for production. Provider-agnostic ingestion via a Dart microservice (StatsBomb + Opta adapter pattern), a four-layer dbt medallion architecture with a universal 105×68 m coordinate system, calibrated XGBoost models for xG/xP, value-iteration xT, K-Means set-piece clustering, and a fully automated BI visualisation layer — all validated through GitHub Actions CI on 147 World Cup 2022 matches (524,457 events).

Analytics Visualisations

Five portfolio-grade charts are generated automatically on every CI run and uploaded as a 90-day artifact. The images below are sourced from the latest successful pipeline run.

xT Surface Heatmap	Shot Map with xG Bubbles

Team xG vs Goals Scored	Set-Piece Delivery Clusters

Player xP Ranking (Top 20)

Charts are generated by scripts/visualise.py using mplsoccer and matplotlib on live pipeline data. The latest PNG artifacts are available under Actions → xForge V2 Pipeline → analytics-visualisations-{run_id}.

Architecture

StatsBomb Open Data / Opta F24
         │
         ▼
 Dart Ingestion Service          HTTP :8090 · Adapter Pattern
 ├── StatsBombAdapter            120×80 → UnifiedEvent
 └── OptaAdapter                 100×100 → UnifiedEvent
         │
         ▼
  PostgreSQL 15 · fact_events    LIST-partitioned by competition_id
         │
         ▼
 dbt Medallion Pipeline
 ├── Bronze                      Type-cast pass-through, provider tests
 ├── Silver                      105×68 m normalisation, spatial range tests
 ├── Gold                        Player / team aggregations with ML values
 └── Marts                       BI-ready tables — player metrics, leaderboards
         │
         ├── ML Models (run before Gold/Marts — write back to fact_events)
         │   ├── xG Model        XGBoost · calibrated goal probability per shot
         │   ├── xT Model        Value iteration · 16×12 grid · 192 cells
         │   └── xP Model        XGBoost · pass completion probability (AUC 0.897)
         │
         ├── Materialized Views  mv_team_xg · mv_shot_locations (zero-downtime refresh)
         ├── BI Visualisations   5 PNG charts · mplsoccer · 90-day CI artifacts
         ├── PDF Match Report    5-page mplsoccer report per match
         └── SportsCode XML      Top-25 xT events · Hudl-compatible

flowchart TD
    SB["StatsBomb Open Data\n147 matches · 524,457 events"]
    OA["Opta F24 JSON\nvia OptaAdapter"]

    subgraph DART["Dart Ingestion Service — :8090"]
        SA["StatsBombAdapter"]
        OPT["OptaAdapter"]
        PW["PostgresWriter\nON CONFLICT DO NOTHING"]
    end

    subgraph PG["PostgreSQL 15"]
        FE["fact_events\nLIST-partitioned · competition_id\nxt_value · xp_value · xg_value"]
        DIM["dim_matches · dim_players\ndim_teams · dim_competitions"]
        AUX["xt_surface 192 cells\nset_piece_clusters\nmodel_registry\nmv_team_xg"]
    end

    subgraph DBT["dbt Medallion"]
        BR["Bronze\ntype-cast pass-through"]
        SL["Silver\n105x68 m normalisation\nspatial range tests"]
        GD["Gold\nplayer + team aggregations"]
        MT["Marts\nBI-ready tables"]
    end

    subgraph ML["ML Pipeline"]
        XG["xG Model\nXGBoost · calibrated"]
        XT["xT Model\nValue iteration · 16x12"]
        XP["xP Model\nXGBoost · AUC 0.897"]
        KM["K-Means\nSet-piece clusters k=6"]
    end

    subgraph SERVE["Serving Layer"]
        VIZ["BI Visualisations\n5 PNG charts · CI artifacts"]
        PDF["PDF Match Report\n5-page mplsoccer"]
        XML["SportsCode XML\n25 top-xT events · Hudl"]
        SUP["Apache Superset\n7 charts · dashboard"]
    end

    SB --> SA
    OA --> OPT
    SA & OPT --> PW --> FE & DIM
    DIM & FE --> BR --> SL
    SL --> XG & XT & XP & KM
    XG & XT & XP & KM --> FE
    FE --> GD --> MT
    MT & AUX --> VIZ & PDF & XML & SUP

What V2 Adds Over V1

Capability	V1	V2
Data providers	StatsBomb only	StatsBomb + Opta (adapter pattern)
Ingestion language	Python	Dart microservice, HTTP API
Coordinate system	StatsBomb 120×80 (raw)	Universal 105×68 m — single source of truth
Data layer	Single staging schema	Bronze / Silver / Gold / Marts medallion
Coordinate tests	None	dbt spatial range tests on every Silver run
xG model	Post-hoc rescaled	CalibratedClassifierCV (Platt scaling)
BI output	Superset (local only)	5 static PNGs via CI — shareable artifacts
CI coverage	Lint + unit tests	Full end-to-end pipeline on 524,457 events
Data volume	Single match	147 WC 2022 matches, bulk incremental loader

ML Models

Model	Algorithm	Input	Result
xT Surface	Value iteration (15 passes)	Silver events · 16×12 grid	192 cells, max xT = 0.298
xG Classifier	XGBoost + Platt scaling	Silver shots — distance, angle, pressure	Calibrated goal probability
xP Classifier	XGBoost	Silver passes — start/end coords, distance, pressure	AUC 0.897 · log-loss 0.293 · 118,187 training passes
Set-piece Clustering	K-Means k=6	Corner + shot locations (105×68 m)	12 cluster centroids (6 corner zones, 6 shot zones)
Press Trigger Detector	Rule-based sequence	Ball recovery + 3 defensive actions / 5 s	165 press triggers detected (WC 2022)

xP engineering note: all three ML models read from the Silver layer and write xp_value / xg_value / xt_value back to fact_events before dbt Gold and Marts materialise — ensuring mart_player_metrics.avg_xp is populated on every run.

xG calibration: CalibratedClassifierCV(XGBClassifier(), cv=5, method='sigmoid') — Platt scaling ensures xG=0.30 represents a genuine ~30% conversion rate, not merely a ranking score. A Brier score and expected-vs-actual goal check gate every training run.

Coordinate normalisation: a single dbt macro normalise_x(col, coord_system) converts any provider's raw coordinates to the 105×68 m universal pitch. Adding a third provider requires one new adapter file and one macro branch — zero changes to downstream models.

Business Use Cases

Pass quality scouting: xP isolates pass difficulty from completion rate. A midfielder completing high-difficulty passes (low xP) in high-threat zones (high xT) appears on no traditional completion-rate leaderboard — but surfaces immediately in mart_player_metrics.

Opponent set-piece analysis: K-Means clustering of 1,384 corners and 4,904 shots across 147 WC matches reveals six repeatable delivery zones per event type. Coaching staff receive cluster centroids and member counts without manual video tagging.

Video integration: the pipeline generates a SportsCode/Hudl-compatible XML file containing the 25 highest-xT events per match. Analysts open the file directly in Hudl Sportscode — no manual timestamp entry.

Striker recruitment: finishing_quality = goals − total_xG separates clinical finishers from shot-volume players. Available in mart_player_metrics; directly queryable in Superset without custom SQL.

Data at Scale — WC 2022 (Competition 43)

Metric	Value
Matches ingested	147
Total events	524,457
xP training passes	118,187
xP model AUC	0.897
Set-piece cluster centroids	12
Press triggers detected	165
xT grid cells	192 (16×12, 105×68 m)
CI pipeline duration	~11 min (full end-to-end)
BI chart artifacts	5 PNGs · 90-day retention

CI/CD Pipeline

Every push to main runs the full pipeline against a live PostgreSQL instance:

checkout
  → install Python 3.11 deps
  → init schema (01_schema.sql + 02_v2_migration.sql)
  → bulk ingest 147 matches (ingest_season=true)
  → build + start Dart ingestion service
  → ingest match events via Dart HTTP API
  → dbt bronze  (run + test)
  → dbt silver  (run + test — spatial range gate)
  → xG model    (XGBoost · calibrated)
  → xT model    (value iteration)
  → xP model    (XGBoost · writes xp_value to fact_events)
  → dbt gold    (run + test)
  → dbt marts   (run + test — avg_xp now populated)
  → refresh materialized views
  → export SportsCode XML
  → tactical models (K-Means · press trigger)
  → generate PDF match report
  → generate analytics visualisations (5 PNG charts)
  → upload artifacts (XML · PDF · 5 PNGs · dbt logs)
  → pipeline summary (GitHub Step Summary)

The Silver spatial range tests (location_x ∈ [0, 105], location_y ∈ [0, 68]) act as a hard gate — if any event falls outside the universal pitch boundary after coordinate normalisation, the pipeline fails before ML models train.

Project Structure

xforge/
├── .github/
│   └── workflows/
│       └── pipeline_v2.yml         # Full end-to-end CI pipeline
├── config/
│   └── superset_config.py          # Superset secret key, DB URI, feature flags
├── dart_ingestion/
│   ├── Dockerfile                  # Multi-stage AOT compile — ~10 MB image
│   ├── pubspec.yaml
│   └── lib/
│       ├── main.dart               # Shelf HTTP server — POST /ingest, GET /health
│       ├── models/unified_event.dart
│       ├── adapters/
│       │   ├── adapter_interface.dart
│       │   ├── statsbomb_adapter.dart   # 120×80 → UnifiedEvent
│       │   └── opta_adapter.dart        # 100×100 → UnifiedEvent
│       └── db/postgres_writer.dart      # Bulk INSERT ON CONFLICT DO NOTHING
├── dbt_project/
│   ├── macros/
│   │   └── coord_normalise.sql     # normalise_x / normalise_y — single normalisation point
│   └── models/
│       ├── bronze/                 # Type-cast pass-through + provider tests
│       ├── silver/                 # 105×68 m normalisation + spatial range tests
│       ├── gold/                   # Player / team aggregations
│       └── marts/                  # mart_player_metrics · mart_team_summary
│                                   # mart_match_summary · mart_competition_leaderboard
├── docs/
│   └── screenshots/                # CI-generated PNGs committed from latest artifact
├── scripts/
│   ├── init/
│   │   ├── 01_schema.sql           # Tables, partitions, indexes
│   │   └── 02_v2_migration.sql     # V2 columns, materialized views
│   ├── massive_ingestion.py        # Incremental bulk StatsBomb loader
│   ├── xg_model.py                 # XGBoost xG + Platt calibration
│   ├── xt_model.py                 # Value-iteration xT surface
│   ├── predictive_models.py        # XGBoost xP + chunked prediction write-back
│   ├── tactical_models.py          # K-Means clustering + press trigger detection
│   ├── visualise.py                # 5 PNG BI charts — mplsoccer + matplotlib
│   ├── report_generator.py         # 5-page PDF per match
│   ├── xml_generator.py            # SportsCode/Hudl XML — top-25 xT events
│   ├── refresh_materialized_views.py
│   └── setup_superset.py           # Autonomous Superset bootstrap — 7 charts + dashboard
├── docker-compose.yml              # Local: Postgres + Superset + pgAdmin
├── requirements.txt
└── LICENSE

Getting Started

Local environment

Prerequisites: Docker >= 24, Docker Compose v2, 4 GB RAM minimum.

git clone https://github.com/bbasaranemir/xforge.git
cd xforge
docker compose up -d postgres
psql -h localhost -U analytics -d football_db -f scripts/init/01_schema.sql
psql -h localhost -U analytics -d football_db -f scripts/init/02_v2_migration.sql

Run the Dart ingestion service:

docker compose up -d dart_ingestion
curl http://localhost:8090/health
# → {"status":"ok"}

curl -X POST http://localhost:8090/ingest \
  -H "Content-Type: application/json" \
  -d '{"provider":"statsbomb","match_id":3869685,"competition_id":43}'
# → {"written":3401}

Run the dbt medallion:

cd dbt_project
dbt deps
dbt run  --select bronze silver gold marts
dbt test --select silver   # spatial range gate

Launch Superset with pre-built dashboards:

docker compose up -d superset
python scripts/setup_superset.py
# → 7 charts and Matchday Analytics dashboard bootstrapped at http://localhost:8088

CI trigger

Navigate to Actions → xForge V2 Pipeline → Run workflow. Set ingest_season: true to load all 147 WC 2022 matches before running models. The full pipeline completes in approximately 11 minutes; five PNG charts are uploaded as a 90-day artifact under analytics-visualisations-{run_id}.

To update the screenshots in this README after a successful run:

# Download analytics-visualisations-{run_id}.zip from the Actions artifact panel,
# extract to docs/screenshots/, then:
git add docs/screenshots/*.png
git commit -m "docs: update BI visualisation screenshots from CI run {run_id}"
git push

Database Schema

dim_competitions ─┐
dim_seasons       ├──► fact_events   (LIST-partitioned by competition_id)
dim_matches       │         │
dim_players       │         ├── xt_value    (value-iteration xT)
dim_teams ────────┘         ├── xp_value    (XGBoost pass completion probability)
                            └── xg_value    (XGBoost calibrated goal probability)
                                 │
                    ┌────────────┼────────────────────┐
                    ▼            ▼                    ▼
              xt_surface    set_piece_clusters    model_registry
              (192 cells)   (12 centroids)        (AUC, log-loss)

                            mv_team_xg            (REFRESH CONCURRENTLY)
                            mv_shot_locations

                            analytics_analytics_marts.*
                            ├── mart_player_metrics        (avg_xp · finishing_quality)
                            ├── mart_team_summary          (xG · shot counts)
                            ├── mart_match_summary         (per-match aggregates)
                            └── mart_competition_leaderboard (xT per match rank)

Partitioning: fact_events uses PostgreSQL LIST partitioning on competition_id. The CI environment adds --shm-size 256m to the Postgres service container to support large cross-partition joins during mart materialisation.

Materialised views: REFRESH MATERIALIZED VIEW CONCURRENTLY is used throughout — BI tools and Superset see no read-lock downtime during refresh cycles.

Data Source

StatsBomb Open Data — used under the StatsBomb Open Data Licence. This project is not affiliated with or endorsed by StatsBomb.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
config		config
dags		dags
dart_ingestion		dart_ingestion
dbt_project		dbt_project
docs		docs
scripts		scripts
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.airflow		Dockerfile.airflow
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
match_3942349.pdf		match_3942349.pdf
match_report.pdf		match_report.pdf
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xForge V2 — Provider-Agnostic Football Analytics Pipeline

Analytics Visualisations

Architecture

What V2 Adds Over V1

ML Models

Business Use Cases

Data at Scale — WC 2022 (Competition 43)

CI/CD Pipeline

Project Structure

Getting Started

Local environment

CI trigger

Database Schema

Data Source

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xForge V2 — Provider-Agnostic Football Analytics Pipeline

Analytics Visualisations

Architecture

What V2 Adds Over V1

ML Models

Business Use Cases

Data at Scale — WC 2022 (Competition 43)

CI/CD Pipeline

Project Structure

Getting Started

Local environment

CI trigger

Database Schema

Data Source

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages