Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling by usptact · Pull Request #496 · dotnet/infer

usptact · 2026-06-27T04:55:35Z

Summary

This PR adds Gaussian Process Vine Copulas (GPVINE) to Infer.NET — a method for
modelling the joint dependence structure of multivariate continuous data, where each
conditional dependence can vary as a learned function of conditioning variables.

It implements:

Lopez-Paz, Hernández-Lobato & Ghahramani, Gaussian Process Vine Copulas for
Multivariate Dependence, ICML 2013.

The change is purely additive — it introduces new types and two new methods on
existing public classes (Factor.BivariateCopula, Variable.BivariateCopula); no
existing behaviour is modified and there are no breaking changes.

Motivation

A copula models dependence separately from marginals; a vine factorizes a
high-dimensional copula into a hierarchy of bivariate copulas. GPVINE drops the usual
"simplifying assumption" by letting each conditional copula's Kendall's τ vary with
its conditioning variables, learned with a sparse GP and Expectation Propagation. This
captures context-dependent dependence (e.g. "the correlation between X and Y changes
with Z") that a covariance matrix or an ordinary copula cannot.

The method is a natural fit for Infer.NET: the GP prior, ARD kernel, FITC sparse
approximation, EP engine, model-evidence blocks and quadrature already exist. The only
genuinely new inference component is a custom copula likelihood EP message operator;
everything else composes existing primitives or is closed-form application code.

What's included

Copula families (Distributions/Copulas/): IBivariateCopula with
GaussianCopula, ClaytonCopula, GumbelCopula; CopulaFamily + CopulaFactory.
EP factor (Factors/): Factor.BivariateCopula and BivariateCopulaOp, a
non-conjugate copula likelihood on a Gaussian latent (modelled on LogisticOp/ExpOp,
with adaptive-quadrature moment matching and an evidence contribution).
Models API: Variable.BivariateCopula(...) so the factor is usable in any graph.
GP conditional fitter (Models): GaussianProcessCopulaFitter — fits one
conditional copula via sparse GP + EP.
Regular-vine layer (Distributions/Copulas/Vine/): RegularVine (PIT, Kendall-τ,
maximum-spanning-tree structure selection, h-function tree recursion, log-likelihood,
generative sampling and conditional simulation), plus EmpiricalMarginal, Pit,
KendallTau, MaxSpanningTree.
Example: Tutorials/GaussianProcessVine.cs ([Example], mirrors the GP classifier).

Capabilities

Fit SVINE (simplifying assumption) and full GPVINE; score held-out log-likelihood.
Learn and predict dependence (τ) as a function of conditioning variables.
Gaussian / Clayton / Gumbel families (symmetric and tail-asymmetric dependence).
Generative sampling (data-scale joint draws) and conditional simulation /
imputation / posterior-predictive given partial observations.

Testing

Adds ~90 XUnit tests (CopulaTests, BivariateCopulaOpTests, VineTests,
GPVineTests) covering: copula math (density normalization, h-functions vs numerical
CDF derivatives, τ↔θ round-trips, inverse-h round-trips); the EP operator and evidence
vs brute-force quadrature; vine structure invariants and log-likelihood; end-to-end GP
recovery of a known g(z); GPVINE beating SVINE on conditionally-dependent data; and
sampling round-trips. Long integration tests are marked Category=Performance.

All tests pass under dotnet test -c DebugCore with the standard exclusion filter.

Backward compatibility

Additive only. New files plus two new methods on Factor/Variable; the new EP
operator is discovered via [FactorMethod] assembly scanning, so no registration or
configuration changes are required.

Known limitations (scoped for follow-up)

Sampling / conditional simulation require a canonical (C-vine) structure; general
R-vine sampling is not yet implemented.
Conditioning is exact for the leading roots (fit with rootOrder); arbitrary-subset
conditioning without refitting is not yet supported.
Frank family, rotated Archimedean copulas (negative tail dependence), and automated
evidence-based hyperparameter optimisation are not implemented.

Commits

Organised as phases for reviewability: copula math → EP factor/operator → single
conditional copula end-to-end → vine first tree → deeper-tree recursion (SVINE) →
GP conditional fitter + driver → Clayton/Gumbel families → sampling → conditional
simulation.

Begin porting the GPVINE method (Lopez-Paz, Hernandez-Lobato & Ghahramani, "Gaussian Process Vine Copulas for Multivariate Dependence", ICML 2013) into Infer.NET. This first phase delivers the pure-numeric copula building blocks (Layer B), with no dependency on the inference compiler. New library src/Runtime/Distributions/Copulas/ (auto-included via the Runtime project's .cs glob), mirroring the Distributions/Kernels style: - IBivariateCopula: interface for a conditional bivariate copula c(u,v|tau). Every family is parametrised through Kendall's tau (Sec. 3, Table 1) so a single latent function g can drive any family. Exposes TauToTheta/ThetaToTau, LogDensity, and the HFunction (conditional CDF) used to generate deeper-tree pseudo-observations (eq. 5). - GaussianCopula: the workhorse family (Appendix A). theta = sin(pi/2 tau); log density (eq. 12) and h-functions (eqs. 13-14) via MMath.NormalCdfInv / NormalCdf, with theta clamped to keep 1 - theta^2 strictly positive as tau -> +-1. - CopulaFamily enum + CopulaFactory. Only Gaussian is implemented; other families throw NotImplementedException pending later phases. Tests (test/Tests/CopulaTests.cs, 7 passing) cover the Phase 0 exit criteria: tau<->theta round-trips, tau=0 => unit density, density symmetry, density integrates to 1 over the unit square, h-functions in (0,1), and h-functions matching central differences of the bivariate normal CDF (pinning the eq. 13/14 "given" convention). Also adds GPVINE.md, the detailed implementation plan and phased roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The core inference deliverable of GPVINE: a custom Expectation Propagation operator that injects the copula likelihood c(u,v | tau) as a non-conjugate term on a single Gaussian latent score = f(z), with the paper's link tau = g(score) = 2*Phi(score) - 1 folded inside the factor (Sec. 3). Structurally this mirrors ExpOp / LogisticOp. Factor.BivariateCopula(double score, int family) -> Vector(u,v) (Factor.cs): the generative form, used only for sampling/testing (samples a correlated normal pair and maps through the standard normal CDF). EP never calls it; it invokes the operator below. BivariateCopulaOp (BivariateCopulaOp.cs), [FactorMethod(Factor, "BivariateCopula")], [Quality(Experimental)]: - ScoreAverageConditional: moment-matches the tilted distribution N(s;m,v) c(u,v|g(s)) and returns projected/cavity (ForceProper), guarding against improper/degenerate results. - LogAverageFactor: the evidence ln Z = log int N(s;m,v) c(u,v|g(s)) ds, used for hyperparameter tuning and family selection (Sec. 4). - LogEvidenceRatio (= LogAverageFactor, since the pair is observed) and a uniform ScoreAverageConditionalInit. Moments are computed by adaptive Clenshaw-Curtis quadrature in standardized coordinates centered on the proposal (score*to_score), with a log-offset folded analytically back into ln Z. The Gaussian copula likelihood is heavy-tailed (it blows up as the latent pushes tau -> +-1 for centered observations), which fixed Gauss-Hermite undersamples; the adaptive rule's 1/tan endpoint transform resolves it, as anticipated in the plan's numerical notes. Tests (test/Tests/BivariateCopulaOpTests.cs, 51 passing) meet the Phase 1 exit criterion: the EP message (natural parameters) and ln Z match a brute-force fine-grid quadrature of the same integrals to < 1e-3 across a grid of cavities and observed pairs. Also covers c(u,v)=c(v,u) symmetry, concordant/discordant data shifting the score the right way, and uniform/point-mass cavity edge cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Wire the copula EP factor into a real Infer.NET model and confirm the full pipeline (compiler discovery, scheduling, EP) recovers a known latent function g(z) - the Fig. 4 analogue of Lopez-Paz et al. (2013). - Variable.BivariateCopula(Variable<double> score, int family) -> Variable<Vector> (Variable.cs), the Models-layer wrapper next to FunctionEvaluate, delegating to Variable<Vector>.Factor. - GPVineTests.SingleConditionalCopula_RecoversG: builds one vine edge (SparseGP prior, score = FunctionEvaluate(f, z), observed copula pair uv = BivariateCopula(score, Gaussian)) on synthetic data drawn from g(z) = 0.6 sin(z), runs EP, and checks the sparse-GP posterior mean of tau tracks the truth on a held-out grid (corr ~0.95, RMSE ~0.14; thresholds corr > 0.8, RMSE < 0.25). This exercises the new factor through the compiler for the first time: [FactorMethod] discovery of BivariateCopulaOp and scheduling of the observed-output factor both work with no additional plumbing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The pure-C# vine application layer (Layer B) that builds and scores the first tree T_1 of a regular vine (Lopez-Paz et al., 2013, Sec. 2). No inference engine is involved: T_1's copulas are unconditional and fitted by the Kendall's-tau MLE, the cheap deterministic path the paper uses for the first tree (Sec. 4). Deeper trees (h-function recursion, eq. 5, and the per-edge sparse-GP fit) remain future work, matching the Python prototype whose build_next_tree is still a stub. New src/Runtime/Distributions/Copulas/Vine/: - Pit.cs empirical PIT u = rank/(n+1) per column, in (0,1), with average-rank tie handling and a unit clamp. - KendallTau.cs empirical Kendall's tau-b between two series (the MST edge weight and the unconditional T_1 fit). - MaxSpanningTree.cs Prim's algorithm returning the d-1 maximum-weight spanning-tree edges on a symmetric |tau| matrix. - VineEdge.cs VineEdge (conditioned/conditioning sets, (u,v) series, Z, weight, fitted tau) + VineTree. - RegularVine.cs Fit(x): PIT then BuildFirstTree via MST on |tau| with per-edge tau MLE; LogLikelihood(x): summed T_1 copula log-density (eq. 4, first-tree terms). Tests (test/Tests/VineTests.cs, 7 passing) cover the Phase 3 exit criteria: PIT ranks/ties, Kendall's tau against hand values, MST picks the heaviest edges, and on an AR(1) Gaussian chain T_1 selects the correct chain structure and recovers tau = (2/pi) arcsin(rho) to < 0.04 while its log-likelihood clearly beats the independence baseline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…VINE) Implements the regular-vine recursion that the Python prototype left as a stub (build_next_tree), so the vine now spans all d-1 trees, not just T_1. - IConditionalCopulaFitter.cs: IConditionalCopulaPosterior (tau = g(z)) and IConditionalCopulaFitter (fits one conditional edge), plus SparseGPCopulaPosterior wrapping a SparseGP posterior with tau = 2*Phi(E[f(z)]) - 1. The fit itself needs the inference engine, so it is injected; the recursion stays in Runtime alongside SparseGP. - VineEdge: extended with conditioned variables Left/Right, conditioning set D(e), the joined node endpoints (for the proximity test), the per-conditioned-variable h-function series, an optional conditional Posterior, and N(e) = C(e) cup D(e). - RegularVine.Fit(x, nTrees, fitter): builds T_1 (unconditional Kendall- tau MLE) then each deeper tree via BuildNextTree - candidate edges join two previous-tree edges sharing a node (proximity condition, Sec. 2.1), the new pseudo-observations are the previous edges' h-functions (eq. 5), and the tree is the maximum spanning tree on the |tau| of those series. With no fitter, deeper edges are fitted unconditionally (the SVINE simplifying-assumption baseline). LogLikelihood replays the fitted structure on the PIT of new data, summing each edge's conditional copula log-density (eq. 4) - a valid held-out log-likelihood. - MaxSpanningTree.Prim now returns a spanning forest when proximity marks pairs as disallowed (NegativeInfinity), so it serves the deeper trees. Tests (VineTests, +2): full-vine structure invariants on equicorrelated Gaussian data (d-level edges per tree, |D(e)| = level-1, |N(e)| = level+1, proximity, d(d-1)/2 edges total) and held-out log-likelihood strictly increasing T_1 < T_2 < T_3 (deeper trees capture real conditional dependence). 9 vine tests passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tion test Completes the full GPVINE pipeline: deeper-tree conditional copulas are now fitted with a sparse GP + EP, reproducing the paper's central result that modelling conditional dependence beats the simplifying assumption. - GaussianProcessCopulaFitter (src/Compiler, IConditionalCopulaFitter): builds the per-edge model (SparseGP prior -> score = f(z) -> Variable.BivariateCopula likelihood), runs EP and returns a SparseGPCopulaPosterior. GP mean initialised to Phi^{-1}((tau_MLE+1)/2) (Sec. 4); ARD kernel plus a white-noise nugget that keeps the inducing- point covariance positive definite for pseudo-observation inputs in (0,1); inducing inputs a subset of the conditioning data (Sec. 3.1). Lives in the modelling layer because the fit needs InferenceEngine; injected into the Runtime vine recursion via the IConditionalCopulaFitter interface. - GaussianProcessVine.cs (src/Tutorials, [Example]): mirrors the GP classifier tutorial; prints SVINE vs GPVINE held-out log-likelihood as the number of trees grows. - GPVineTests.GPVine_BeatsSVine_WhenConditionalDependenceVaries: on data whose (X,Y)|Z copula has a Z-varying tau, GPVINE's held-out log-likelihood clearly exceeds SVINE's (observed ~319 vs ~219 nats). Tagged Performance so it is excluded from the default fast filter. This satisfies the Phase 4 exit criterion: the full GPVINE runs end to end and conditional modelling raises the test log-likelihood (Fig. 3 trend). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds two Archimedean families alongside the Gaussian copula. The EP operator, the Variable.BivariateCopula factor, the vine recursion and the GP fitter are all family-agnostic, so only the family classes plus the factory dispatch were needed - no operator/vine/fitter changes. - ClaytonCopula (lower-tail dependence): theta = 2 tau/(1-tau); closed-form log density and exchangeable h-functions; an independence guard for theta -> 0. - GumbelCopula (upper-tail dependence): theta = 1/(1-tau); log density and h-functions computed in log space via LogSumExp so that x^theta does not overflow/underflow for tau near 1 (where a naive evaluation spuriously returned +Inf in the EP quadrature). - Both clamp tau to the family's (0, 1) range, since the GPVINE link tau = 2*Phi(f) - 1 spans (-1, 1) and these families model only positive dependence (negative-dependence rotations are left for later). - CopulaFactory now builds Clayton and Gumbel; Frank still throws. Tests (+12): - CopulaTests: tau<->theta round-trip, independence limit, density integrates to 1, and h-functions matching numerical dC/du, dC/dv from each family's closed-form CDF. - BivariateCopulaOpTests: the EP message and lnZ for Clayton and Gumbel match brute-force quadrature (run through the actual factor operator). - VineTests: Clayton-sampled data fitted with RegularVine(Clayton) recovers its tau and beats a Gaussian-copula fit (true family wins). - GPVineTests: a conditional Clayton copula fitted through the real EP factor + compiler recovers tau(z) (corr ~0.91). Performance-tagged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Turns the fitted vine into a generative model: draw joint samples, do conditional/posterior-predictive simulation. Sampling runs the vine's h-function machinery backwards via the inverse-Rosenblatt transform. Math additions: - IBivariateCopula.InverseHFunction(w, x, tau, given): inverts the conditional CDF. Closed form for Gaussian and Clayton; monotone bisection for Gumbel. Round-trip tested against HFunction for all families. - EmpiricalMarginal: the inverse of the rank-based PIT (quantile function at plotting positions k/(n+1)), so samples are returned on the data scale. - IConditionalCopulaPosterior.SampleTau(z): a tau drawn from the GP posterior at z (SparseGPCopulaPosterior samples f ~ posterior), so posterior-predictive draws reflect the latent function's uncertainty rather than just its mean. Vine: - VineStructure enum (Regular / Canonical) added to Fit; the existing MST builder and a new maximum-weight star builder share one selection hook. Fitting/scoring stay general; sampling uses a canonical (C-vine) structure. - RegularVine.Sample(n): the canonical-vine inverse-Rosenblatt simulation (Aas et al. 2009), mapping back to data units via the stored marginals, drawing conditional-edge tau from the GP posterior. Throws a clear message if the vine was fitted Regular. Tests (+10): inverse-h round-trips; the C-vine sampler reproduces the training data's pairwise Kendall tau and marginals (Gaussian and Clayton); sample -> refit recovers tau; Sample throws for a Regular structure; and a Performance-tagged GPVINE posterior-predictive draw reproduces the hub dependence. 88 GPVINE tests passing. General R-vine sampling (arbitrary MST structures) is left as a follow-up; fit with VineStructure.Canonical to sample today. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds conditional simulation to the canonical-vine sampler: fix a chosen set of variables to observed values and draw the rest from their conditional distribution (imputation / posterior-predictive given partial observations). - EmpiricalMarginal.Cdf: the forward empirical PIT (inverse of Quantile), mapping a known data value to its pseudo-observation. - RegularVine.Fit(..., rootOrder): forces the leading C-vine roots, so a chosen conditioning set becomes a prefix of the vine order. Threaded through the canonical first-/next-tree star selection (forced centre). - RegularVine.SampleConditional(known, n): seeds the inverse-Rosenblatt sweep with the forward transform of the known values (held fixed) and samples the remainder; returns data-scale rows with the conditioned variables equal to their inputs. Conditional-edge tau is drawn from the GP posterior, so imputations are posterior-predictive. Exact when the conditioned set is the leading roots; throws with guidance otherwise. Tests (+3): conditioning fixes the value exactly and shifts the sampled others in the right direction (Gaussian); non-root / regular-structure conditioning throws; and a Performance-tagged GPVINE imputation reproduces the Z-dependent sign of the (X,Y)|Z copula (tau ~ +0.39 at Z=1, -0.71 at Z=-1). 91 GPVINE tests passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

usptact · 2026-06-27T04:57:42Z

@dotnet-policy-service agree

usptact and others added 9 commits June 26, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496

Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496
usptact wants to merge 9 commits into
dotnet:mainfrom
usptact:feature/gpvine

usptact commented Jun 27, 2026 •

edited

Loading

Uh oh!

usptact commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

usptact commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What's included

Capabilities

Testing

Backward compatibility

Known limitations (scoped for follow-up)

Commits

Uh oh!

usptact commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

usptact commented Jun 27, 2026 •

edited

Loading