Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496
Open
usptact wants to merge 9 commits into
Open
Add Gaussian Process Vine Copulas (GPVINE) for multivariate dependence modelling#496usptact wants to merge 9 commits into
usptact wants to merge 9 commits into
Conversation
Begin porting the GPVINE method (Lopez-Paz, Hernandez-Lobato & Ghahramani, "Gaussian Process Vine Copulas for Multivariate Dependence", ICML 2013) into Infer.NET. This first phase delivers the pure-numeric copula building blocks (Layer B), with no dependency on the inference compiler. New library src/Runtime/Distributions/Copulas/ (auto-included via the Runtime project's .cs glob), mirroring the Distributions/Kernels style: - IBivariateCopula: interface for a conditional bivariate copula c(u,v|tau). Every family is parametrised through Kendall's tau (Sec. 3, Table 1) so a single latent function g can drive any family. Exposes TauToTheta/ThetaToTau, LogDensity, and the HFunction (conditional CDF) used to generate deeper-tree pseudo-observations (eq. 5). - GaussianCopula: the workhorse family (Appendix A). theta = sin(pi/2 tau); log density (eq. 12) and h-functions (eqs. 13-14) via MMath.NormalCdfInv / NormalCdf, with theta clamped to keep 1 - theta^2 strictly positive as tau -> +-1. - CopulaFamily enum + CopulaFactory. Only Gaussian is implemented; other families throw NotImplementedException pending later phases. Tests (test/Tests/CopulaTests.cs, 7 passing) cover the Phase 0 exit criteria: tau<->theta round-trips, tau=0 => unit density, density symmetry, density integrates to 1 over the unit square, h-functions in (0,1), and h-functions matching central differences of the bivariate normal CDF (pinning the eq. 13/14 "given" convention). Also adds GPVINE.md, the detailed implementation plan and phased roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The core inference deliverable of GPVINE: a custom Expectation Propagation operator that injects the copula likelihood c(u,v | tau) as a non-conjugate term on a single Gaussian latent score = f(z), with the paper's link tau = g(score) = 2*Phi(score) - 1 folded inside the factor (Sec. 3). Structurally this mirrors ExpOp / LogisticOp. Factor.BivariateCopula(double score, int family) -> Vector(u,v) (Factor.cs): the generative form, used only for sampling/testing (samples a correlated normal pair and maps through the standard normal CDF). EP never calls it; it invokes the operator below. BivariateCopulaOp (BivariateCopulaOp.cs), [FactorMethod(Factor, "BivariateCopula")], [Quality(Experimental)]: - ScoreAverageConditional: moment-matches the tilted distribution N(s;m,v) c(u,v|g(s)) and returns projected/cavity (ForceProper), guarding against improper/degenerate results. - LogAverageFactor: the evidence ln Z = log int N(s;m,v) c(u,v|g(s)) ds, used for hyperparameter tuning and family selection (Sec. 4). - LogEvidenceRatio (= LogAverageFactor, since the pair is observed) and a uniform ScoreAverageConditionalInit. Moments are computed by adaptive Clenshaw-Curtis quadrature in standardized coordinates centered on the proposal (score*to_score), with a log-offset folded analytically back into ln Z. The Gaussian copula likelihood is heavy-tailed (it blows up as the latent pushes tau -> +-1 for centered observations), which fixed Gauss-Hermite undersamples; the adaptive rule's 1/tan endpoint transform resolves it, as anticipated in the plan's numerical notes. Tests (test/Tests/BivariateCopulaOpTests.cs, 51 passing) meet the Phase 1 exit criterion: the EP message (natural parameters) and ln Z match a brute-force fine-grid quadrature of the same integrals to < 1e-3 across a grid of cavities and observed pairs. Also covers c(u,v)=c(v,u) symmetry, concordant/discordant data shifting the score the right way, and uniform/point-mass cavity edge cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the copula EP factor into a real Infer.NET model and confirm the full pipeline (compiler discovery, scheduling, EP) recovers a known latent function g(z) - the Fig. 4 analogue of Lopez-Paz et al. (2013). - Variable.BivariateCopula(Variable<double> score, int family) -> Variable<Vector> (Variable.cs), the Models-layer wrapper next to FunctionEvaluate, delegating to Variable<Vector>.Factor. - GPVineTests.SingleConditionalCopula_RecoversG: builds one vine edge (SparseGP prior, score = FunctionEvaluate(f, z), observed copula pair uv = BivariateCopula(score, Gaussian)) on synthetic data drawn from g(z) = 0.6 sin(z), runs EP, and checks the sparse-GP posterior mean of tau tracks the truth on a held-out grid (corr ~0.95, RMSE ~0.14; thresholds corr > 0.8, RMSE < 0.25). This exercises the new factor through the compiler for the first time: [FactorMethod] discovery of BivariateCopulaOp and scheduling of the observed-output factor both work with no additional plumbing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pure-C# vine application layer (Layer B) that builds and scores the
first tree T_1 of a regular vine (Lopez-Paz et al., 2013, Sec. 2). No
inference engine is involved: T_1's copulas are unconditional and fitted
by the Kendall's-tau MLE, the cheap deterministic path the paper uses for
the first tree (Sec. 4). Deeper trees (h-function recursion, eq. 5, and
the per-edge sparse-GP fit) remain future work, matching the Python
prototype whose build_next_tree is still a stub.
New src/Runtime/Distributions/Copulas/Vine/:
- Pit.cs empirical PIT u = rank/(n+1) per column, in (0,1),
with average-rank tie handling and a unit clamp.
- KendallTau.cs empirical Kendall's tau-b between two series (the MST
edge weight and the unconditional T_1 fit).
- MaxSpanningTree.cs Prim's algorithm returning the d-1 maximum-weight
spanning-tree edges on a symmetric |tau| matrix.
- VineEdge.cs VineEdge (conditioned/conditioning sets, (u,v) series,
Z, weight, fitted tau) + VineTree.
- RegularVine.cs Fit(x): PIT then BuildFirstTree via MST on |tau| with
per-edge tau MLE; LogLikelihood(x): summed T_1 copula
log-density (eq. 4, first-tree terms).
Tests (test/Tests/VineTests.cs, 7 passing) cover the Phase 3 exit
criteria: PIT ranks/ties, Kendall's tau against hand values, MST picks
the heaviest edges, and on an AR(1) Gaussian chain T_1 selects the
correct chain structure and recovers tau = (2/pi) arcsin(rho) to < 0.04
while its log-likelihood clearly beats the independence baseline.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…VINE) Implements the regular-vine recursion that the Python prototype left as a stub (build_next_tree), so the vine now spans all d-1 trees, not just T_1. - IConditionalCopulaFitter.cs: IConditionalCopulaPosterior (tau = g(z)) and IConditionalCopulaFitter (fits one conditional edge), plus SparseGPCopulaPosterior wrapping a SparseGP posterior with tau = 2*Phi(E[f(z)]) - 1. The fit itself needs the inference engine, so it is injected; the recursion stays in Runtime alongside SparseGP. - VineEdge: extended with conditioned variables Left/Right, conditioning set D(e), the joined node endpoints (for the proximity test), the per-conditioned-variable h-function series, an optional conditional Posterior, and N(e) = C(e) cup D(e). - RegularVine.Fit(x, nTrees, fitter): builds T_1 (unconditional Kendall- tau MLE) then each deeper tree via BuildNextTree - candidate edges join two previous-tree edges sharing a node (proximity condition, Sec. 2.1), the new pseudo-observations are the previous edges' h-functions (eq. 5), and the tree is the maximum spanning tree on the |tau| of those series. With no fitter, deeper edges are fitted unconditionally (the SVINE simplifying-assumption baseline). LogLikelihood replays the fitted structure on the PIT of new data, summing each edge's conditional copula log-density (eq. 4) - a valid held-out log-likelihood. - MaxSpanningTree.Prim now returns a spanning forest when proximity marks pairs as disallowed (NegativeInfinity), so it serves the deeper trees. Tests (VineTests, +2): full-vine structure invariants on equicorrelated Gaussian data (d-level edges per tree, |D(e)| = level-1, |N(e)| = level+1, proximity, d(d-1)/2 edges total) and held-out log-likelihood strictly increasing T_1 < T_2 < T_3 (deeper trees capture real conditional dependence). 9 vine tests passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion test
Completes the full GPVINE pipeline: deeper-tree conditional copulas are now
fitted with a sparse GP + EP, reproducing the paper's central result that
modelling conditional dependence beats the simplifying assumption.
- GaussianProcessCopulaFitter (src/Compiler, IConditionalCopulaFitter):
builds the per-edge model (SparseGP prior -> score = f(z) ->
Variable.BivariateCopula likelihood), runs EP and returns a
SparseGPCopulaPosterior. GP mean initialised to Phi^{-1}((tau_MLE+1)/2)
(Sec. 4); ARD kernel plus a white-noise nugget that keeps the inducing-
point covariance positive definite for pseudo-observation inputs in (0,1);
inducing inputs a subset of the conditioning data (Sec. 3.1). Lives in the
modelling layer because the fit needs InferenceEngine; injected into the
Runtime vine recursion via the IConditionalCopulaFitter interface.
- GaussianProcessVine.cs (src/Tutorials, [Example]): mirrors the GP
classifier tutorial; prints SVINE vs GPVINE held-out log-likelihood as the
number of trees grows.
- GPVineTests.GPVine_BeatsSVine_WhenConditionalDependenceVaries: on data
whose (X,Y)|Z copula has a Z-varying tau, GPVINE's held-out log-likelihood
clearly exceeds SVINE's (observed ~319 vs ~219 nats). Tagged Performance so
it is excluded from the default fast filter.
This satisfies the Phase 4 exit criterion: the full GPVINE runs end to end
and conditional modelling raises the test log-likelihood (Fig. 3 trend).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds two Archimedean families alongside the Gaussian copula. The EP operator, the Variable.BivariateCopula factor, the vine recursion and the GP fitter are all family-agnostic, so only the family classes plus the factory dispatch were needed - no operator/vine/fitter changes. - ClaytonCopula (lower-tail dependence): theta = 2 tau/(1-tau); closed-form log density and exchangeable h-functions; an independence guard for theta -> 0. - GumbelCopula (upper-tail dependence): theta = 1/(1-tau); log density and h-functions computed in log space via LogSumExp so that x^theta does not overflow/underflow for tau near 1 (where a naive evaluation spuriously returned +Inf in the EP quadrature). - Both clamp tau to the family's (0, 1) range, since the GPVINE link tau = 2*Phi(f) - 1 spans (-1, 1) and these families model only positive dependence (negative-dependence rotations are left for later). - CopulaFactory now builds Clayton and Gumbel; Frank still throws. Tests (+12): - CopulaTests: tau<->theta round-trip, independence limit, density integrates to 1, and h-functions matching numerical dC/du, dC/dv from each family's closed-form CDF. - BivariateCopulaOpTests: the EP message and lnZ for Clayton and Gumbel match brute-force quadrature (run through the actual factor operator). - VineTests: Clayton-sampled data fitted with RegularVine(Clayton) recovers its tau and beats a Gaussian-copula fit (true family wins). - GPVineTests: a conditional Clayton copula fitted through the real EP factor + compiler recovers tau(z) (corr ~0.91). Performance-tagged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Turns the fitted vine into a generative model: draw joint samples, do conditional/posterior-predictive simulation. Sampling runs the vine's h-function machinery backwards via the inverse-Rosenblatt transform. Math additions: - IBivariateCopula.InverseHFunction(w, x, tau, given): inverts the conditional CDF. Closed form for Gaussian and Clayton; monotone bisection for Gumbel. Round-trip tested against HFunction for all families. - EmpiricalMarginal: the inverse of the rank-based PIT (quantile function at plotting positions k/(n+1)), so samples are returned on the data scale. - IConditionalCopulaPosterior.SampleTau(z): a tau drawn from the GP posterior at z (SparseGPCopulaPosterior samples f ~ posterior), so posterior-predictive draws reflect the latent function's uncertainty rather than just its mean. Vine: - VineStructure enum (Regular / Canonical) added to Fit; the existing MST builder and a new maximum-weight star builder share one selection hook. Fitting/scoring stay general; sampling uses a canonical (C-vine) structure. - RegularVine.Sample(n): the canonical-vine inverse-Rosenblatt simulation (Aas et al. 2009), mapping back to data units via the stored marginals, drawing conditional-edge tau from the GP posterior. Throws a clear message if the vine was fitted Regular. Tests (+10): inverse-h round-trips; the C-vine sampler reproduces the training data's pairwise Kendall tau and marginals (Gaussian and Clayton); sample -> refit recovers tau; Sample throws for a Regular structure; and a Performance-tagged GPVINE posterior-predictive draw reproduces the hub dependence. 88 GPVINE tests passing. General R-vine sampling (arbitrary MST structures) is left as a follow-up; fit with VineStructure.Canonical to sample today. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds conditional simulation to the canonical-vine sampler: fix a chosen set of variables to observed values and draw the rest from their conditional distribution (imputation / posterior-predictive given partial observations). - EmpiricalMarginal.Cdf: the forward empirical PIT (inverse of Quantile), mapping a known data value to its pseudo-observation. - RegularVine.Fit(..., rootOrder): forces the leading C-vine roots, so a chosen conditioning set becomes a prefix of the vine order. Threaded through the canonical first-/next-tree star selection (forced centre). - RegularVine.SampleConditional(known, n): seeds the inverse-Rosenblatt sweep with the forward transform of the known values (held fixed) and samples the remainder; returns data-scale rows with the conditioned variables equal to their inputs. Conditional-edge tau is drawn from the GP posterior, so imputations are posterior-predictive. Exact when the conditioned set is the leading roots; throws with guidance otherwise. Tests (+3): conditioning fixes the value exactly and shifts the sampled others in the right direction (Gaussian); non-root / regular-structure conditioning throws; and a Performance-tagged GPVINE imputation reproduces the Z-dependent sign of the (X,Y)|Z copula (tau ~ +0.39 at Z=1, -0.71 at Z=-1). 91 GPVINE tests passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
|
@dotnet-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds Gaussian Process Vine Copulas (GPVINE) to Infer.NET — a method for
modelling the joint dependence structure of multivariate continuous data, where each
conditional dependence can vary as a learned function of conditioning variables.
It implements:
The change is purely additive — it introduces new types and two new methods on
existing public classes (
Factor.BivariateCopula,Variable.BivariateCopula); noexisting behaviour is modified and there are no breaking changes.
Motivation
A copula models dependence separately from marginals; a vine factorizes a
high-dimensional copula into a hierarchy of bivariate copulas. GPVINE drops the usual
"simplifying assumption" by letting each conditional copula's Kendall's τ vary with
its conditioning variables, learned with a sparse GP and Expectation Propagation. This
captures context-dependent dependence (e.g. "the correlation between X and Y changes
with Z") that a covariance matrix or an ordinary copula cannot.
The method is a natural fit for Infer.NET: the GP prior, ARD kernel, FITC sparse
approximation, EP engine, model-evidence blocks and quadrature already exist. The only
genuinely new inference component is a custom copula likelihood EP message operator;
everything else composes existing primitives or is closed-form application code.
What's included
Distributions/Copulas/):IBivariateCopulawithGaussianCopula,ClaytonCopula,GumbelCopula;CopulaFamily+CopulaFactory.Factors/):Factor.BivariateCopulaandBivariateCopulaOp, anon-conjugate copula likelihood on a Gaussian latent (modelled on
LogisticOp/ExpOp,with adaptive-quadrature moment matching and an evidence contribution).
Variable.BivariateCopula(...)so the factor is usable in any graph.Models):GaussianProcessCopulaFitter— fits oneconditional copula via sparse GP + EP.
Distributions/Copulas/Vine/):RegularVine(PIT, Kendall-τ,maximum-spanning-tree structure selection, h-function tree recursion, log-likelihood,
generative sampling and conditional simulation), plus
EmpiricalMarginal,Pit,KendallTau,MaxSpanningTree.Tutorials/GaussianProcessVine.cs([Example], mirrors the GP classifier).Capabilities
imputation / posterior-predictive given partial observations.
Testing
Adds ~90 XUnit tests (
CopulaTests,BivariateCopulaOpTests,VineTests,GPVineTests) covering: copula math (density normalization, h-functions vs numericalCDF derivatives, τ↔θ round-trips, inverse-h round-trips); the EP operator and evidence
vs brute-force quadrature; vine structure invariants and log-likelihood; end-to-end GP
recovery of a known g(z); GPVINE beating SVINE on conditionally-dependent data; and
sampling round-trips. Long integration tests are marked
Category=Performance.All tests pass under
dotnet test -c DebugCorewith the standard exclusion filter.Backward compatibility
Additive only. New files plus two new methods on
Factor/Variable; the new EPoperator is discovered via
[FactorMethod]assembly scanning, so no registration orconfiguration changes are required.
Known limitations (scoped for follow-up)
R-vine sampling is not yet implemented.
rootOrder); arbitrary-subsetconditioning without refitting is not yet supported.
evidence-based hyperparameter optimisation are not implemented.
Commits
Organised as phases for reviewability: copula math → EP factor/operator → single
conditional copula end-to-end → vine first tree → deeper-tree recursion (SVINE) →
GP conditional fitter + driver → Clayton/Gumbel families → sampling → conditional
simulation.