TDA-Learning

A self-directed curriculum for research in Topological Data Analysis (TDA) and Persistent Homology-built project-first with comprehension checkpoints after each stage

Shape is a language. This is how I'm learning to read it.

What this is

TDA extracts shape-based features from data using tools from algebraic topology. Instead of asking "how close are points?" it asks "what loops, voids, and connected components as we grow a neighborhood radius?"

This repo documents a five-stage learning roadmap — from pipeline basics through real-data experiments and ML integration.

Roadmap

Stage	Topic	Status
01	Pipeline Fluency — Noisy circle, Ripser, H₀/H₁, barcodes, persistence diagrams	✅ Complete
02	Complex Comparison — Vietoris-Rips vs Alpha complexes	✅ Complete
03	Geometric Interpretation — Torus, MNIST digits (cubical persistence), Betti numbers	⌛ In Progress
04	Real Data — RCSB PDB proteins, OpenTopography terrain	🔲 Upcoming
05	ML Integration — Persistence images as classifier features	🔲 Upcoming

Project Structure

tda-learning/
├── README.md
├── requirements.txt
├── tda/                            # Shared TDA utilities package
│   ├── __init__.py                 # Re-exports full public API
│   ├── tda_data.py                 # Pure computation layer (no matplotlib)
│   ├── tda_viz.py                  # Atomic ax-level renderers
│   └── tda_figures.py              # GridSpec figure orchestrators
├── portfolio/
│   └── tda-portfolio.html          # Visual project tracker
├── stage-01-pipeline-fluency/
│   └── Noisy_Circle_Comparison.ipynb
├── stage-02-complex-comparison/
├── stage-03-geometric-interpretation/
├── stage-04-real-data/
└── stage-05-ml-integration/

The `tda/` package is a three-layer library extracted from the notebooks. All stage notebooks import from it rather than re-implementing computation or plotting inline. See `tda/README.md` for a full API reference.

Stage 01 — Noist Circle $H_1$ Recovery

Core question: Can persistent homology recover the circular structure of a point cloud under increasing noise?

What it does:

Generates noisy samples from $S^1$ (a circle in 2D)
Runs Vietoris-Rips filtration via Ripser
Computes $H_0$ (connected components) and $H_1$ (loops)
Visualizes persistence diagrams and barcodes
Sweeps noise levels to find the persistnece threshold at which the $H_1$ signal degrades

Key insights:

One long bar in $H_1$ = one persistent loop = the circle. Everything else is noise.
Hierarchy of Variables:
- Sample Size determines the foundation
  - As sample size increases, the gaps between points become smaller (higher density):
    - denser grid picks up signal sooner → shifts the birth value left
    - triangles which fill in the loop appear sooner → shifts death value left
      - death drops more slowly than birth → persistence increases with sample size
- Noise determines the signal quality
  - As noise level increases, more points scatter inward/outward of the circle → pairwise gaps become irregular:
    - some "shortcut edge" appears at a smaller $\epsilon$ than it would on a clean circle → kills cycle earlier → lower death value
    - scattered points means loops form later → birth rises slightly
    - more noise → lower persistence
- Threshold permanently loses data and is used for visual interpretation ONLY
Birth/Death are decoupled — birth does NOT determine death (they happen at independent filtration values)
- BIRTH reflects SAMPLING
- DEATH reflects GEOMETRY
- if circle radius doubles → birth AND death values double
- low sample size can mean true loop never closes cleanly
  - small sample size INCREASES the birth value (doesn't affect death) — sparse point cloud's require larger radius for filtration construction
- noise creates many short-spanned spurious loops
  - High Noise + Low n → noisy features can OUTIVE signal feature. ONLY FIX IS MORE DATA

Stage 02 — Vietoris-Rips vs Alpha Complex Comparison

Core question: Do complex choice and simplex count matter if the persistence diagrams are (nearly) the same?

What it does:

Generates shared point clouds (noisy circle, torus) used by both complexes
Computes Rips PH via Ripser; Alpha PH via Gudhi
Compares simplex counts, runtimes, and filtration values
Inspects the Delaunay triangulation underlying Alpha using scipy.spatial
Quantifies diagram similarity via bottleneck and Wasserstein distances

How they're built

Vietoris-Rips adds an edge between two points whenever their distance ≤ ε, and fills in higher simplices whenever all pairwise edges exist. It has no awareness of geometry — it checks all pairwise distances. This causes simplex counts to grow as $O(n^k)$ for $k$-dimensional complexes.

Alpha is constrained to the Delaunay triangulation. A simplex is only added at filtration value $\alpha$ if its circumsphere radius² ≤ $\alpha$ and the circumsphere contains no other points (the empty circumsphere condition). This geometric grounding keeps simplex counts at $O(n)$ in 2D/3D.

Analogy: Rips inflates a balloon uniformly around every point with no awareness of neighbours. Alpha is shrink-wrap — it only grows where the geometry says points are genuinely adjacent.

The Delaunay triangulation is Alpha's backbone

Alpha cannot add any simplex that doesn't already exist in the Delaunay triangulation. At each filtration step, a Delaunay simplex is admitted only if its circumsphere has been reached. This means:

Alpha is a strict subset of the Delaunay triangulation at every ε
Rips can produce "geometrically impossible" simplices — triangles whose circumcircles contain other points — because it applies no such constraint

Simplex count comparison

Rips simplex counts grew dramatically with sample size; Alpha grew near-linearly:

Dataset	n	Rips edges	Alpha simplices	Ratio
Noisy circle	100	4462	555	~8×
Noisy circle	200	17154	1149	~15×
Noisy circle	300	39710	1747	~23×
Torus	300	94044	12783	~7.4×

The downstream consequence: more simplices → larger boundary matrices → slower matrix reduction (the core PH algorithm) → higher RAM usage. Rips becomes computationally impractical before Alpha does on the same dataset.

Filtration values are not directly comparable

Rips filtration values are raw pairwise distances. Gudhi's Alpha filtration values are squared circumradii ($\alpha = r^2$), which is why they appear much smaller. Don't compare them numerically without accounting for this scaling.

Despite different structures, diagrams are nearly identical

The Nerve Theorem guarantees this: both complexes are valid approximations of the same underlying topological space (the union of balls at radius ε). Any "good cover" of that space yields the same persistent homology — so despite different simplex sets, they're triangulating the same shape.

This was verified quantitatively using:

Bottleneck distance — smallest possible worst-case matched pair displacement between diagrams
Wasserstein distance — total displacement across all matched pairs (not just the worst one)

A bottleneck distance much smaller than the signal bar's persistence confirms the diagrams are functionally equivalent.

NOTE: Alpha complex PD's have a different scale than Rips because $\alpha = r^2$, therefore the bottleneck and wasserstein distances in this comparison will indicate that the diagrams are very different, despite them conveying almost identical information about the same point cloud.

Circle

Torus

Torus topology recovery

The torus has known Betti numbers: $\beta_0 = 1$, $\beta_1 = 2$, $\beta_2 = 1$.

Feature	Meaning
$\beta_0 = 1$	One connected component
$\beta_1 = 2$	Two independent loops (short way and long way around the donut)
$\beta_2 = 1$	One enclosed void (the interior of the tube surface)

Both complexes recovered this signature. Note: the torus Rips/Alpha ratio (~7.4×) is lower than the circle at the same sample size (23x)— the torus is a 2D surface in 3D, so the Delaunay triangulation is still sparse relative to what Rips constructs, but less aggressively so than on a 1D curve.

When to use which complex

Criterion	Use Alpha	Use Rips
Dimensionality	2D or 3D data	Any dimension
Data type	Geometric / spatial	Abstract / high-dimensional
Sample size	Large $n$ (efficiency matters)	Small–medium $n$
Filtration values	Need geometric interpretability	Distance-based is sufficient
Implementation	Gudhi	Ripser

Limit of Alpha: Delaunay triangulation in high dimensions is computationally intractable. For data beyond ~3D, Rips is the default.

Key takeaways

Same topology, different structure — Rips and Alpha are two roads to the same destination. The Nerve Theorem ensures they agree on what matters.
The simplex count gap is large and grows — Alpha's geometric constraint keeps it $O(n)$; Rips blows up as $O(n^k)$.
Filtration values are not interchangeable — Rips uses distances, Alpha uses squared circumradii.
Alpha is Delaunay-constrained — it cannot add geometrically unjustified simplices. Rips can and does.
The persistence diagram is a record, not a live object — features are born and die during filtration; the diagram captures when, not what's currently active.
High-dimensional data breaks Alpha — Rips is the universal fallback when geometry can't be leveraged.

Setup

Requirements: Python 3.8+, WSL or Linux/macOS recommended

git clone https://github.com/declanecr/tda-learning.git
cd tda-learning
pip install -r requirements.txt
jupyter lab

Core libraries:

ripser
persim
gudhi
scikit-learn
scipy
numpy
matplotlib

Key concepts in scope

Simplicial complexes — Vietoris-Rips, Alpha, Čech
Filtration — growing the neighborhood radius ε from 0 → ∞
Persistent homology — tracking when topological features are born and die
Betti numbers — β₀ (components), β₁ (loops), β₂ (voids)
Persistence diagrams & barcodes — the standard TDA output
Cycle representatives — which actual data points form a loop
Persistence images — vectorised PH output for ML pipelines
Nerve Theorem — why Vietoris-Rips captures topology at all

Resources

Built with curiosity and persistent homology.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Understanding Persistent Homology		Understanding Persistent Homology
portfolio		portfolio
stage-01-pipeline-fluency		stage-01-pipeline-fluency
stage-02-complex-comparison		stage-02-complex-comparison
stage-03-meaningful-geometry		stage-03-meaningful-geometry
stage-04-Real-Data		stage-04-Real-Data
stage-05-ML-integration/Protein Structures		stage-05-ML-integration/Protein Structures
tda		tda
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDA-Learning

What this is

Roadmap

Project Structure

The `tda/` package is a three-layer library extracted from the notebooks. All stage notebooks import from it rather than re-implementing computation or plotting inline. See `tda/README.md` for a full API reference.

Stage 01 — Noist Circle $H_1$ Recovery

Stage 02 — Vietoris-Rips vs Alpha Complex Comparison

How they're built

The Delaunay triangulation is Alpha's backbone

Simplex count comparison

Filtration values are not directly comparable

Despite different structures, diagrams are nearly identical

Circle

Torus

Torus topology recovery

When to use which complex

Key takeaways

Setup

Key concepts in scope

Resources

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TDA-Learning

What this is

Roadmap

Project Structure

The tda/ package is a three-layer library extracted from the notebooks. All stage notebooks import from it rather than re-implementing computation or plotting inline. See tda/README.md for a full API reference.

Stage 01 — Noist Circle $H_1$ Recovery

Stage 02 — Vietoris-Rips vs Alpha Complex Comparison

How they're built

The Delaunay triangulation is Alpha's backbone

Simplex count comparison

Filtration values are not directly comparable

Despite different structures, diagrams are nearly identical

Circle

Torus

Torus topology recovery

When to use which complex

Key takeaways

Setup

Key concepts in scope

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `tda/` package is a three-layer library extracted from the notebooks. All stage notebooks import from it rather than re-implementing computation or plotting inline. See `tda/README.md` for a full API reference.

Packages