Skip to content

Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496

Open
usptact wants to merge 9 commits into
dotnet:mainfrom
usptact:feature/gpvine
Open

Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496
usptact wants to merge 9 commits into
dotnet:mainfrom
usptact:feature/gpvine

Conversation

@usptact

@usptact usptact commented Jun 27, 2026

Copy link
Copy Markdown

Summary

This PR adds Gaussian Process Vine Copulas (GPVINE) to Infer.NET — a method for
modelling the joint dependence structure of multivariate continuous data, where each
conditional dependence can vary as a learned function of conditioning variables.

It implements:

Lopez-Paz, Hernández-Lobato & Ghahramani, Gaussian Process Vine Copulas for
Multivariate Dependence
, ICML 2013.

The change is purely additive — it introduces new types and two new methods on
existing public classes (Factor.BivariateCopula, Variable.BivariateCopula); no
existing behaviour is modified and there are no breaking changes.

Motivation

A copula models dependence separately from marginals; a vine factorizes a
high-dimensional copula into a hierarchy of bivariate copulas. GPVINE drops the usual
"simplifying assumption" by letting each conditional copula's Kendall's τ vary with
its conditioning variables, learned with a sparse GP and Expectation Propagation. This
captures context-dependent dependence (e.g. "the correlation between X and Y changes
with Z") that a covariance matrix or an ordinary copula cannot.

The method is a natural fit for Infer.NET: the GP prior, ARD kernel, FITC sparse
approximation, EP engine, model-evidence blocks and quadrature already exist. The only
genuinely new inference component is a custom copula likelihood EP message operator;
everything else composes existing primitives or is closed-form application code.

What's included

  • Copula families (Distributions/Copulas/): IBivariateCopula with
    GaussianCopula, ClaytonCopula, GumbelCopula; CopulaFamily + CopulaFactory.
  • EP factor (Factors/): Factor.BivariateCopula and BivariateCopulaOp, a
    non-conjugate copula likelihood on a Gaussian latent (modelled on LogisticOp/ExpOp,
    with adaptive-quadrature moment matching and an evidence contribution).
  • Models API: Variable.BivariateCopula(...) so the factor is usable in any graph.
  • GP conditional fitter (Models): GaussianProcessCopulaFitter — fits one
    conditional copula via sparse GP + EP.
  • Regular-vine layer (Distributions/Copulas/Vine/): RegularVine (PIT, Kendall-τ,
    maximum-spanning-tree structure selection, h-function tree recursion, log-likelihood,
    generative sampling and conditional simulation), plus EmpiricalMarginal, Pit,
    KendallTau, MaxSpanningTree.
  • Example: Tutorials/GaussianProcessVine.cs ([Example], mirrors the GP classifier).

Capabilities

  • Fit SVINE (simplifying assumption) and full GPVINE; score held-out log-likelihood.
  • Learn and predict dependence (τ) as a function of conditioning variables.
  • Gaussian / Clayton / Gumbel families (symmetric and tail-asymmetric dependence).
  • Generative sampling (data-scale joint draws) and conditional simulation /
    imputation / posterior-predictive given partial observations.

Testing

Adds ~90 XUnit tests (CopulaTests, BivariateCopulaOpTests, VineTests,
GPVineTests) covering: copula math (density normalization, h-functions vs numerical
CDF derivatives, τ↔θ round-trips, inverse-h round-trips); the EP operator and evidence
vs brute-force quadrature; vine structure invariants and log-likelihood; end-to-end GP
recovery of a known g(z); GPVINE beating SVINE on conditionally-dependent data; and
sampling round-trips. Long integration tests are marked Category=Performance.

All tests pass under dotnet test -c DebugCore with the standard exclusion filter.

Backward compatibility

Additive only. New files plus two new methods on Factor/Variable; the new EP
operator is discovered via [FactorMethod] assembly scanning, so no registration or
configuration changes are required.

Known limitations (scoped for follow-up)

  • Sampling / conditional simulation require a canonical (C-vine) structure; general
    R-vine sampling is not yet implemented.
  • Conditioning is exact for the leading roots (fit with rootOrder); arbitrary-subset
    conditioning without refitting is not yet supported.
  • Frank family, rotated Archimedean copulas (negative tail dependence), and automated
    evidence-based hyperparameter optimisation are not implemented.

Commits

Organised as phases for reviewability: copula math → EP factor/operator → single
conditional copula end-to-end → vine first tree → deeper-tree recursion (SVINE) →
GP conditional fitter + driver → Clayton/Gumbel families → sampling → conditional
simulation.

usptact and others added 9 commits June 26, 2026 10:21
Begin porting the GPVINE method (Lopez-Paz, Hernandez-Lobato &
Ghahramani, "Gaussian Process Vine Copulas for Multivariate
Dependence", ICML 2013) into Infer.NET. This first phase delivers the
pure-numeric copula building blocks (Layer B), with no dependency on the
inference compiler.

New library src/Runtime/Distributions/Copulas/ (auto-included via the
Runtime project's .cs glob), mirroring the Distributions/Kernels style:

- IBivariateCopula: interface for a conditional bivariate copula
  c(u,v|tau). Every family is parametrised through Kendall's tau (Sec. 3,
  Table 1) so a single latent function g can drive any family. Exposes
  TauToTheta/ThetaToTau, LogDensity, and the HFunction (conditional CDF)
  used to generate deeper-tree pseudo-observations (eq. 5).
- GaussianCopula: the workhorse family (Appendix A). theta = sin(pi/2 tau);
  log density (eq. 12) and h-functions (eqs. 13-14) via MMath.NormalCdfInv
  / NormalCdf, with theta clamped to keep 1 - theta^2 strictly positive as
  tau -> +-1.
- CopulaFamily enum + CopulaFactory. Only Gaussian is implemented; other
  families throw NotImplementedException pending later phases.

Tests (test/Tests/CopulaTests.cs, 7 passing) cover the Phase 0 exit
criteria: tau<->theta round-trips, tau=0 => unit density, density
symmetry, density integrates to 1 over the unit square, h-functions in
(0,1), and h-functions matching central differences of the bivariate
normal CDF (pinning the eq. 13/14 "given" convention). Also adds GPVINE.md,
the detailed implementation plan and phased roadmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The core inference deliverable of GPVINE: a custom Expectation
Propagation operator that injects the copula likelihood c(u,v | tau)
as a non-conjugate term on a single Gaussian latent score = f(z), with
the paper's link tau = g(score) = 2*Phi(score) - 1 folded inside the
factor (Sec. 3). Structurally this mirrors ExpOp / LogisticOp.

Factor.BivariateCopula(double score, int family) -> Vector(u,v)
(Factor.cs): the generative form, used only for sampling/testing
(samples a correlated normal pair and maps through the standard normal
CDF). EP never calls it; it invokes the operator below.

BivariateCopulaOp (BivariateCopulaOp.cs), [FactorMethod(Factor,
"BivariateCopula")], [Quality(Experimental)]:
- ScoreAverageConditional: moment-matches the tilted distribution
  N(s;m,v) c(u,v|g(s)) and returns projected/cavity (ForceProper),
  guarding against improper/degenerate results.
- LogAverageFactor: the evidence ln Z = log int N(s;m,v) c(u,v|g(s)) ds,
  used for hyperparameter tuning and family selection (Sec. 4).
- LogEvidenceRatio (= LogAverageFactor, since the pair is observed) and
  a uniform ScoreAverageConditionalInit.

Moments are computed by adaptive Clenshaw-Curtis quadrature in
standardized coordinates centered on the proposal (score*to_score),
with a log-offset folded analytically back into ln Z. The Gaussian
copula likelihood is heavy-tailed (it blows up as the latent pushes
tau -> +-1 for centered observations), which fixed Gauss-Hermite
undersamples; the adaptive rule's 1/tan endpoint transform resolves it,
as anticipated in the plan's numerical notes.

Tests (test/Tests/BivariateCopulaOpTests.cs, 51 passing) meet the Phase 1
exit criterion: the EP message (natural parameters) and ln Z match a
brute-force fine-grid quadrature of the same integrals to < 1e-3 across a
grid of cavities and observed pairs. Also covers c(u,v)=c(v,u) symmetry,
concordant/discordant data shifting the score the right way, and
uniform/point-mass cavity edge cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the copula EP factor into a real Infer.NET model and confirm the
full pipeline (compiler discovery, scheduling, EP) recovers a known
latent function g(z) - the Fig. 4 analogue of Lopez-Paz et al. (2013).

- Variable.BivariateCopula(Variable<double> score, int family) ->
  Variable<Vector> (Variable.cs), the Models-layer wrapper next to
  FunctionEvaluate, delegating to Variable<Vector>.Factor.

- GPVineTests.SingleConditionalCopula_RecoversG: builds one vine edge
  (SparseGP prior, score = FunctionEvaluate(f, z), observed copula pair
  uv = BivariateCopula(score, Gaussian)) on synthetic data drawn from
  g(z) = 0.6 sin(z), runs EP, and checks the sparse-GP posterior mean of
  tau tracks the truth on a held-out grid (corr ~0.95, RMSE ~0.14;
  thresholds corr > 0.8, RMSE < 0.25).

This exercises the new factor through the compiler for the first time:
[FactorMethod] discovery of BivariateCopulaOp and scheduling of the
observed-output factor both work with no additional plumbing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pure-C# vine application layer (Layer B) that builds and scores the
first tree T_1 of a regular vine (Lopez-Paz et al., 2013, Sec. 2). No
inference engine is involved: T_1's copulas are unconditional and fitted
by the Kendall's-tau MLE, the cheap deterministic path the paper uses for
the first tree (Sec. 4). Deeper trees (h-function recursion, eq. 5, and
the per-edge sparse-GP fit) remain future work, matching the Python
prototype whose build_next_tree is still a stub.

New src/Runtime/Distributions/Copulas/Vine/:
- Pit.cs           empirical PIT u = rank/(n+1) per column, in (0,1),
                   with average-rank tie handling and a unit clamp.
- KendallTau.cs    empirical Kendall's tau-b between two series (the MST
                   edge weight and the unconditional T_1 fit).
- MaxSpanningTree.cs  Prim's algorithm returning the d-1 maximum-weight
                   spanning-tree edges on a symmetric |tau| matrix.
- VineEdge.cs      VineEdge (conditioned/conditioning sets, (u,v) series,
                   Z, weight, fitted tau) + VineTree.
- RegularVine.cs   Fit(x): PIT then BuildFirstTree via MST on |tau| with
                   per-edge tau MLE; LogLikelihood(x): summed T_1 copula
                   log-density (eq. 4, first-tree terms).

Tests (test/Tests/VineTests.cs, 7 passing) cover the Phase 3 exit
criteria: PIT ranks/ties, Kendall's tau against hand values, MST picks
the heaviest edges, and on an AR(1) Gaussian chain T_1 selects the
correct chain structure and recovers tau = (2/pi) arcsin(rho) to < 0.04
while its log-likelihood clearly beats the independence baseline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…VINE)

Implements the regular-vine recursion that the Python prototype left as a
stub (build_next_tree), so the vine now spans all d-1 trees, not just T_1.

- IConditionalCopulaFitter.cs: IConditionalCopulaPosterior (tau = g(z))
  and IConditionalCopulaFitter (fits one conditional edge), plus
  SparseGPCopulaPosterior wrapping a SparseGP posterior with
  tau = 2*Phi(E[f(z)]) - 1. The fit itself needs the inference engine, so
  it is injected; the recursion stays in Runtime alongside SparseGP.

- VineEdge: extended with conditioned variables Left/Right, conditioning
  set D(e), the joined node endpoints (for the proximity test), the
  per-conditioned-variable h-function series, an optional conditional
  Posterior, and N(e) = C(e) cup D(e).

- RegularVine.Fit(x, nTrees, fitter): builds T_1 (unconditional Kendall-
  tau MLE) then each deeper tree via BuildNextTree - candidate edges join
  two previous-tree edges sharing a node (proximity condition, Sec. 2.1),
  the new pseudo-observations are the previous edges' h-functions (eq. 5),
  and the tree is the maximum spanning tree on the |tau| of those series.
  With no fitter, deeper edges are fitted unconditionally (the SVINE
  simplifying-assumption baseline). LogLikelihood replays the fitted
  structure on the PIT of new data, summing each edge's conditional copula
  log-density (eq. 4) - a valid held-out log-likelihood.

- MaxSpanningTree.Prim now returns a spanning forest when proximity marks
  pairs as disallowed (NegativeInfinity), so it serves the deeper trees.

Tests (VineTests, +2): full-vine structure invariants on equicorrelated
Gaussian data (d-level edges per tree, |D(e)| = level-1, |N(e)| = level+1,
proximity, d(d-1)/2 edges total) and held-out log-likelihood strictly
increasing T_1 < T_2 < T_3 (deeper trees capture real conditional
dependence). 9 vine tests passing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion test

Completes the full GPVINE pipeline: deeper-tree conditional copulas are now
fitted with a sparse GP + EP, reproducing the paper's central result that
modelling conditional dependence beats the simplifying assumption.

- GaussianProcessCopulaFitter (src/Compiler, IConditionalCopulaFitter):
  builds the per-edge model (SparseGP prior -> score = f(z) ->
  Variable.BivariateCopula likelihood), runs EP and returns a
  SparseGPCopulaPosterior. GP mean initialised to Phi^{-1}((tau_MLE+1)/2)
  (Sec. 4); ARD kernel plus a white-noise nugget that keeps the inducing-
  point covariance positive definite for pseudo-observation inputs in (0,1);
  inducing inputs a subset of the conditioning data (Sec. 3.1). Lives in the
  modelling layer because the fit needs InferenceEngine; injected into the
  Runtime vine recursion via the IConditionalCopulaFitter interface.

- GaussianProcessVine.cs (src/Tutorials, [Example]): mirrors the GP
  classifier tutorial; prints SVINE vs GPVINE held-out log-likelihood as the
  number of trees grows.

- GPVineTests.GPVine_BeatsSVine_WhenConditionalDependenceVaries: on data
  whose (X,Y)|Z copula has a Z-varying tau, GPVINE's held-out log-likelihood
  clearly exceeds SVINE's (observed ~319 vs ~219 nats). Tagged Performance so
  it is excluded from the default fast filter.

This satisfies the Phase 4 exit criterion: the full GPVINE runs end to end
and conditional modelling raises the test log-likelihood (Fig. 3 trend).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds two Archimedean families alongside the Gaussian copula. The EP
operator, the Variable.BivariateCopula factor, the vine recursion and the
GP fitter are all family-agnostic, so only the family classes plus the
factory dispatch were needed - no operator/vine/fitter changes.

- ClaytonCopula (lower-tail dependence): theta = 2 tau/(1-tau); closed-form
  log density and exchangeable h-functions; an independence guard for
  theta -> 0.
- GumbelCopula (upper-tail dependence): theta = 1/(1-tau); log density and
  h-functions computed in log space via LogSumExp so that x^theta does not
  overflow/underflow for tau near 1 (where a naive evaluation spuriously
  returned +Inf in the EP quadrature).
- Both clamp tau to the family's (0, 1) range, since the GPVINE link
  tau = 2*Phi(f) - 1 spans (-1, 1) and these families model only positive
  dependence (negative-dependence rotations are left for later).
- CopulaFactory now builds Clayton and Gumbel; Frank still throws.

Tests (+12):
- CopulaTests: tau<->theta round-trip, independence limit, density
  integrates to 1, and h-functions matching numerical dC/du, dC/dv from
  each family's closed-form CDF.
- BivariateCopulaOpTests: the EP message and lnZ for Clayton and Gumbel
  match brute-force quadrature (run through the actual factor operator).
- VineTests: Clayton-sampled data fitted with RegularVine(Clayton) recovers
  its tau and beats a Gaussian-copula fit (true family wins).
- GPVineTests: a conditional Clayton copula fitted through the real EP
  factor + compiler recovers tau(z) (corr ~0.91). Performance-tagged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Turns the fitted vine into a generative model: draw joint samples, do
conditional/posterior-predictive simulation. Sampling runs the vine's
h-function machinery backwards via the inverse-Rosenblatt transform.

Math additions:
- IBivariateCopula.InverseHFunction(w, x, tau, given): inverts the
  conditional CDF. Closed form for Gaussian and Clayton; monotone bisection
  for Gumbel. Round-trip tested against HFunction for all families.
- EmpiricalMarginal: the inverse of the rank-based PIT (quantile function at
  plotting positions k/(n+1)), so samples are returned on the data scale.
- IConditionalCopulaPosterior.SampleTau(z): a tau drawn from the GP
  posterior at z (SparseGPCopulaPosterior samples f ~ posterior), so
  posterior-predictive draws reflect the latent function's uncertainty
  rather than just its mean.

Vine:
- VineStructure enum (Regular / Canonical) added to Fit; the existing MST
  builder and a new maximum-weight star builder share one selection hook.
  Fitting/scoring stay general; sampling uses a canonical (C-vine) structure.
- RegularVine.Sample(n): the canonical-vine inverse-Rosenblatt simulation
  (Aas et al. 2009), mapping back to data units via the stored marginals,
  drawing conditional-edge tau from the GP posterior. Throws a clear message
  if the vine was fitted Regular.

Tests (+10): inverse-h round-trips; the C-vine sampler reproduces the
training data's pairwise Kendall tau and marginals (Gaussian and Clayton);
sample -> refit recovers tau; Sample throws for a Regular structure; and a
Performance-tagged GPVINE posterior-predictive draw reproduces the hub
dependence. 88 GPVINE tests passing.

General R-vine sampling (arbitrary MST structures) is left as a follow-up;
fit with VineStructure.Canonical to sample today.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds conditional simulation to the canonical-vine sampler: fix a chosen set
of variables to observed values and draw the rest from their conditional
distribution (imputation / posterior-predictive given partial observations).

- EmpiricalMarginal.Cdf: the forward empirical PIT (inverse of Quantile),
  mapping a known data value to its pseudo-observation.
- RegularVine.Fit(..., rootOrder): forces the leading C-vine roots, so a
  chosen conditioning set becomes a prefix of the vine order. Threaded
  through the canonical first-/next-tree star selection (forced centre).
- RegularVine.SampleConditional(known, n): seeds the inverse-Rosenblatt
  sweep with the forward transform of the known values (held fixed) and
  samples the remainder; returns data-scale rows with the conditioned
  variables equal to their inputs. Conditional-edge tau is drawn from the GP
  posterior, so imputations are posterior-predictive. Exact when the
  conditioned set is the leading roots; throws with guidance otherwise.

Tests (+3): conditioning fixes the value exactly and shifts the sampled
others in the right direction (Gaussian); non-root / regular-structure
conditioning throws; and a Performance-tagged GPVINE imputation reproduces
the Z-dependent sign of the (X,Y)|Z copula (tau ~ +0.39 at Z=1, -0.71 at
Z=-1). 91 GPVINE tests passing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@usptact

usptact commented Jun 27, 2026

Copy link
Copy Markdown
Author

@dotnet-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant