A self-directed curriculum for research in Topological Data Analysis (TDA) and Persistent Homology-built project-first with comprehension checkpoints after each stage
Shape is a language. This is how I'm learning to read it.
TDA extracts shape-based features from data using tools from algebraic topology. Instead of asking "how close are points?" it asks "what loops, voids, and connected components as we grow a neighborhood radius?"
This repo documents a five-stage learning roadmap — from pipeline basics through real-data experiments and ML integration.
| Stage | Topic | Status |
|---|---|---|
| 01 | Pipeline Fluency — Noisy circle, Ripser, H₀/H₁, barcodes, persistence diagrams | ✅ Complete |
| 02 | Complex Comparison — Vietoris-Rips vs Alpha complexes | ✅ Complete |
| 03 | Geometric Interpretation — Torus, MNIST digits (cubical persistence), Betti numbers | ⌛ In Progress |
| 04 | Real Data — RCSB PDB proteins, OpenTopography terrain | 🔲 Upcoming |
| 05 | ML Integration — Persistence images as classifier features | 🔲 Upcoming |
tda-learning/
├── README.md
├── requirements.txt
├── tda/ # Shared TDA utilities package
│ ├── __init__.py # Re-exports full public API
│ ├── tda_data.py # Pure computation layer (no matplotlib)
│ ├── tda_viz.py # Atomic ax-level renderers
│ └── tda_figures.py # GridSpec figure orchestrators
├── portfolio/
│ └── tda-portfolio.html # Visual project tracker
├── stage-01-pipeline-fluency/
│ └── Noisy_Circle_Comparison.ipynb
├── stage-02-complex-comparison/
├── stage-03-geometric-interpretation/
├── stage-04-real-data/
└── stage-05-ml-integration/
The tda/ package is a three-layer library extracted from the notebooks. All stage notebooks import from it rather than re-implementing computation or plotting inline. See tda/README.md for a full API reference.
Core question: Can persistent homology recover the circular structure of a point cloud under increasing noise?
What it does:
- Generates noisy samples from
$S^1$ (a circle in 2D) - Runs Vietoris-Rips filtration via Ripser
- Computes
$H_0$ (connected components) and$H_1$ (loops) - Visualizes persistence diagrams and barcodes
- Sweeps noise levels to find the persistnece threshold at which the
$H_1$ signal degrades
Key insights:
-
One long bar in
$H_1$ = one persistent loop = the circle. Everything else is noise.
-
Hierarchy of Variables:
-
Sample Size determines the foundation
- As sample size increases, the gaps between points become smaller (higher density):
- denser grid picks up signal sooner → shifts the birth value left
- triangles which fill in the loop appear sooner → shifts death value left
- death drops more slowly than birth → persistence increases with sample size
- As sample size increases, the gaps between points become smaller (higher density):
-
Noise determines the signal quality
- As noise level increases, more points scatter inward/outward of the circle → pairwise gaps become irregular:
- some "shortcut edge" appears at a smaller
$\epsilon$ than it would on a clean circle → kills cycle earlier → lower death value - scattered points means loops form later → birth rises slightly
- more noise → lower persistence
- some "shortcut edge" appears at a smaller
- As noise level increases, more points scatter inward/outward of the circle → pairwise gaps become irregular:
-
Threshold permanently loses data and is used for visual interpretation ONLY
-
-
Birth/Death are decoupled — birth does NOT determine death (they happen at independent filtration values)
- BIRTH reflects SAMPLING
- DEATH reflects GEOMETRY
- if circle radius doubles → birth AND death values double
- low sample size can mean true loop never closes cleanly
- small sample size INCREASES the birth value (doesn't affect death) — sparse point cloud's require larger radius for filtration construction
- noise creates many short-spanned spurious loops
Core question: Do complex choice and simplex count matter if the persistence diagrams are (nearly) the same?
What it does:
- Generates shared point clouds (noisy circle, torus) used by both complexes
- Computes Rips PH via Ripser; Alpha PH via Gudhi
- Compares simplex counts, runtimes, and filtration values
- Inspects the Delaunay triangulation underlying Alpha using
scipy.spatial - Quantifies diagram similarity via bottleneck and Wasserstein distances
Vietoris-Rips adds an edge between two points whenever their distance ≤ ε, and fills in higher simplices whenever all pairwise edges exist. It has no awareness of geometry — it checks all pairwise distances. This causes simplex counts to grow as
Alpha is constrained to the Delaunay triangulation. A simplex is only added at filtration value
Analogy: Rips inflates a balloon uniformly around every point with no awareness of neighbours. Alpha is shrink-wrap — it only grows where the geometry says points are genuinely adjacent.
Alpha cannot add any simplex that doesn't already exist in the Delaunay triangulation. At each filtration step, a Delaunay simplex is admitted only if its circumsphere has been reached. This means:
- Alpha is a strict subset of the Delaunay triangulation at every ε
- Rips can produce "geometrically impossible" simplices — triangles whose circumcircles contain other points — because it applies no such constraint
Rips simplex counts grew dramatically with sample size; Alpha grew near-linearly:
| Dataset | n | Rips edges | Alpha simplices | Ratio |
|---|---|---|---|---|
| Noisy circle | 100 | 4462 | 555 | ~8× |
| Noisy circle | 200 | 17154 | 1149 | ~15× |
| Noisy circle | 300 | 39710 | 1747 | ~23× |
| Torus | 300 | 94044 | 12783 | ~7.4× |
The downstream consequence: more simplices → larger boundary matrices → slower matrix reduction (the core PH algorithm) → higher RAM usage. Rips becomes computationally impractical before Alpha does on the same dataset.
Rips filtration values are raw pairwise distances. Gudhi's Alpha filtration values are squared circumradii (
The Nerve Theorem guarantees this: both complexes are valid approximations of the same underlying topological space (the union of balls at radius ε). Any "good cover" of that space yields the same persistent homology — so despite different simplex sets, they're triangulating the same shape.
This was verified quantitatively using:
- Bottleneck distance — smallest possible worst-case matched pair displacement between diagrams
- Wasserstein distance — total displacement across all matched pairs (not just the worst one)
A bottleneck distance much smaller than the signal bar's persistence confirms the diagrams are functionally equivalent.
NOTE: Alpha complex PD's have a different scale than Rips because
The torus has known Betti numbers:
| Feature | Meaning |
|---|---|
| One connected component | |
| Two independent loops (short way and long way around the donut) | |
| One enclosed void (the interior of the tube surface) |
Both complexes recovered this signature. Note: the torus Rips/Alpha ratio (~7.4×) is lower than the circle at the same sample size (23x)— the torus is a 2D surface in 3D, so the Delaunay triangulation is still sparse relative to what Rips constructs, but less aggressively so than on a 1D curve.
| Criterion | Use Alpha | Use Rips |
|---|---|---|
| Dimensionality | 2D or 3D data | Any dimension |
| Data type | Geometric / spatial | Abstract / high-dimensional |
| Sample size | Large |
Small–medium |
| Filtration values | Need geometric interpretability | Distance-based is sufficient |
| Implementation | Gudhi | Ripser |
Limit of Alpha: Delaunay triangulation in high dimensions is computationally intractable. For data beyond ~3D, Rips is the default.
- Same topology, different structure — Rips and Alpha are two roads to the same destination. The Nerve Theorem ensures they agree on what matters.
-
The simplex count gap is large and grows — Alpha's geometric constraint keeps it
$O(n)$ ; Rips blows up as$O(n^k)$ . - Filtration values are not interchangeable — Rips uses distances, Alpha uses squared circumradii.
- Alpha is Delaunay-constrained — it cannot add geometrically unjustified simplices. Rips can and does.
- The persistence diagram is a record, not a live object — features are born and die during filtration; the diagram captures when, not what's currently active.
- High-dimensional data breaks Alpha — Rips is the universal fallback when geometry can't be leveraged.
Requirements: Python 3.8+, WSL or Linux/macOS recommended
git clone https://github.com/declanecr/tda-learning.git
cd tda-learning
pip install -r requirements.txt
jupyter labCore libraries:
ripser
persim
gudhi
scikit-learn
scipy
numpy
matplotlib
- Simplicial complexes — Vietoris-Rips, Alpha, Čech
- Filtration — growing the neighborhood radius ε from 0 → ∞
- Persistent homology — tracking when topological features are born and die
- Betti numbers — β₀ (components), β₁ (loops), β₂ (voids)
- Persistence diagrams & barcodes — the standard TDA output
- Cycle representatives — which actual data points form a loop
- Persistence images — vectorised PH output for ML pipelines
- Nerve Theorem — why Vietoris-Rips captures topology at all
Built with curiosity and persistent homology.

