anomstream

A composable Rust toolkit for streaming anomaly detection. Multiple detector families (multivariate, per-feature, score-level) plus the primitives needed to turn them into production pipelines — streaming stats, normalisation, probability calibration, alert clustering, feedback loops, SOC triage, hot-path ingress.

Among the detectors: an AWS-conformant Random Cut Forest implementation (Guha et al. ICML 2016) — one of several detectors.

Powers the ML detection pipeline of the eBPFsentinel Enterprise NDR agent; designed to be reused anywhere a stream of high-dim observations needs online scoring.

Scope

In scope — streaming, bounded-memory, online-update primitives:

Multivariate anomaly detection (Random Cut Forest and variants)
Time-series discord / motif (Matrix Profile / STOMP — exact batch complement to the online shingled forest)
Evaluation metric (VUS-PR — threshold-free, length-aware AUC-PR for time-series) + TSB-AD-M CSV loader
Per-feature drift detectors (EWMA z-score, two-sided CUSUM, PSI / KL)
Score-level drift + regime-change (meta CUSUM, ADWIN, SPOT / DSPOT)
Streaming stats + sketches (Welford OnlineStats, t-digest, histograms, Count-Min Sketch, HyperLogLog, Space-Saving top-K, Bloom filter)
Normalisation (Normalizer<D> — min-max / z-score / identity)
Explanation + triage (per-dim attribution, SAGE Shapley, Platt calibration, severity bands, alert clustering, SOC feedback, audit trail, forensic baseline)
Hot-path ingress (sampler, rate cap, bounded MPSC channel, pluggable metrics sink)

Out of scope — intentionally absent to keep the library focused:

Protocol parsers, IP-centric trackers, L7 intelligence
ONNX / torch runtimes, supervised model training
Rule synthesis, policy engines
Density estimation, forecasting, GLAD variant, near-neighbour list, feature-completion impute() (the RCF "imputation" idea is repurposed as a SOC-triage forensic_baseline helper).

The Random Cut Forest implementation inside the toolkit is a focused port of the 2016 paper — not an attempt to match every feature of AWS's randomcutforest-by-aws.

Catalogue

Multivariate anomaly detectors — operate on the joint [f64; D] distribution

RandomCutForest<D> — AWS-conformant aggregate root (Guha 2016)
ThresholdedForest<D> — adaptive threshold wrapper (TRCF)
ShingledForest — scalar-stream temporal wrapper over the forest
MatrixProfile — STOMP exact batch time-series discord / motif (complements ShingledForest)
DynamicForest — runtime-dim variant
DriftAwareForest — shadow-swap recovery when a drift detector fires
TenantForestPool — bounded per-tenant forest pool with LRU eviction

Per-feature univariate detectors — one accumulator per dimension, finds which feature drifted

PerFeatureEwma<D> — parallel univariate EWMA z-score detector
PerFeatureCusum<D> — parallel two-sided CUSUM change-point detector
FeatureDriftDetector<D> — PSI / KL distributional drift on raw features

Score-level drift + regime change — operate on a scalar anomaly-score stream

MetaDriftDetector — two-sided CUSUM on the score stream
AdwinDetector — adaptive windowing (Bifet 2007)
PotDetector — SPOT / DSPOT univariate Peaks-Over-Threshold (Siffer 2017)
fisher_combine — combine k independent p-values into one test statistic

Streaming stats + sketches — bounded-memory summaries reused across detectors

OnlineStats — Welford streaming mean + variance
TDigest — Dunning streaming quantile digest
ScoreHistogram — fixed-bin score histogram
CountMinSketch — probabilistic frequency sketch (std-gated)
HyperLogLog — probabilistic distinct-count / cardinality sketch (std-gated)
SpaceSaving<K> — deterministic O(K)-memory top-K heavy hitters (std-gated, complements CountMinSketch)
BloomFilter — probabilistic set-membership for IOC lookup (std-gated, zero false negatives, tunable FPR)
Normalizer<D> — per-feature MinMax / ZScore / None transforms (with fit(&[[f64; D]]) learner)

Explanation + triage

DiVector + FeatureGroups — per-dim and per-group attribution
AttributionStability — inter-tree dispersion + confidence
SageEstimator<D> — SAGE Shapley attribution (Covert 2020)
PlattCalibrator — batch + online-SGD probability calibration
SeverityBands / Severity — ordinal severity classification

SOC + ops

AlertClusterer / LshAlertClusterer — cosine + LSH alert dedup (LSH carries per-instance random seed against collision-craft attacks)
FeedbackStore — SOC-label-driven score adjustment, capped at MAX_CAPACITY = 65 536 labels
AlertRecord / AlertContext — immutable alert envelope (triage crate); #[serde(deny_unknown_fields)] rejects schema-drift splices
AuditChain / AuditChainEntry / verify_audit_chain — HMAC-SHA256-chained tamper-evident audit trail (audit-integrity feature)
ForensicBaseline — post-hoc distance-to-sample summary

Hot-path ingress

hot_path::UpdateSampler (new / new_keyed / new_keyed_with_seeds) — stride or per-flow-hash sampler, optional 128-bit per-instance secret (against MITRE ATLAS AML.T0020), with caller-supplied-seed variant for restricted environments where getrandom is unavailable
hot_path::PrefixRateCap::new(NonZeroU32, NonZeroU64) / disabled(NonZeroU64) — typed-cap 256-bucket atomic counter sketch, cache-line-padded buckets defeat false sharing across cores
hot_path::update_channel / try_update_channel — bounded MPSC channel (capacity 1..=MAX_CHANNEL_CAPACITY, validated) for classifier/updater thread split; non-panicking try_* Result variants
MetricsSink — pluggable telemetry (NoopSink + your own impl); hot-path dispatch is batched every METRICS_BATCH_SIZE = 64 ops (≈64× fewer vtable calls under line-rate load), call flush_metrics() at shutdown to drain residue

Evaluation

vus_pr / vus_pr_with_buffer / range_auc_pr — Volume Under Surface PR (Paparrizos VLDB 2022), threshold-free length-aware quality metric
TsbAdMDataset — CSV loader for the TSB-AD-M multivariate benchmark (Liu & Paparrizos NeurIPS 2024)
examples/tsb_ad_m_eval.rs — end-to-end runner: load one TSB-AD-M CSV, score with DynamicForest, report VUS-PR

See docs/features.md for the full module catalogue with per-feature rationale.

Crate layout

The toolkit ships as a Cargo workspace with four members. Consumers can depend on the anomstream meta-crate for the simple case or pick individual members when minimising dep graph matters.

Crate	Role	Publish
`anomstream`	Facade — feature-gated re-exports of the three members. Primary public-facing crate.	`anomstream`
`anomstream-core`	Detectors + streaming primitives + cross-cut contracts (`MetricsSink`, `SeverityBands`, `ForestSnapshot`). Targets SemVer 1.0 first.	`anomstream-core`
`anomstream-triage`	SOC-opinionated higher-level layer — Platt, SAGE, alert clustering, feedback store, alert record. Depends on core.	`anomstream-triage`
`anomstream-hotpath`	Opinionated eBPF-style ingress primitives — `UpdateSampler`, `PrefixRateCap`, bounded MPSC `channel`. Depends on core.	`anomstream-hotpath`

Three consumption patterns:

# Default — core detectors + primitives, minimal dep graph.
# Every other layer is an explicit opt-in so downstream consumers
# do not pay for dependencies they never use.
[dependencies]
anomstream = "0.0.0-dev"

# Full facade — core + triage + hotpath + parallel + serde + postcard.
# Convenience for deployments that want everything wired in.
[dependencies]
anomstream = { version = "0.0.0-dev", features = ["full"] }

# Fine-grained — pick the layers you need.
[dependencies]
anomstream = { version = "0.0.0-dev", features = ["core", "triage", "serde"] }

# Member-direct — when you want per-member SemVer tracking.
[dependencies]
anomstream-core   = { version = "0.0.0-dev", features = ["parallel", "serde"] }
anomstream-triage = { version = "0.0.0-dev" }

Quickstart

Two examples — one per detector family — to show the toolkit is more than its forest.

Multivariate: Random Cut Forest

use anomstream::ForestBuilder;

let mut forest = ForestBuilder::<4>::new()
    .num_trees(100)
    .sample_size(256)
    .seed(42)
    .build()?;

for point in stream_of_points {
    forest.update(point)?;
    let score = forest.score(&point)?;
    if f64::from(score) > 1.5 {
        eprintln!("anomaly: {score}");
    }
}
# Ok::<(), anomstream::RcfError>(())

Per-feature: two-sided CUSUM change-point

use anomstream::{PerFeatureCusum, PerFeatureCusumConfig};

let mut det = PerFeatureCusum::<4>::new(PerFeatureCusumConfig {
    slack: 0.5,
    threshold: 5.0,
});

for point in stream_of_points {
    let result = det.observe(&point);
    for alert in &result.alerts {
        eprintln!(
            "drift on feature {} ({:?}) magnitude {:.2}",
            alert.feature_index, alert.direction, alert.magnitude
        );
    }
}

The two detectors compose: feed the forest's scalar score into MetaDriftDetector for score-level regime change; run PerFeatureCusum alongside for per-feature attribution; wrap everything in ThresholdedForest for adaptive alerting.

Algorithms

Each detector cites the paper it implements. Representative references by family:

Random Cut Forest — Guha, Mishra, Roy, Schrijvers, Robust Random Cut Forest Based Anomaly Detection on Streams, ICML 2016. Reservoir sampling without replacement: Park, Ostrouchov, Samatova, Geist — SIAM SDM 2004.
EWMA — Hunter, The Exponentially Weighted Moving Average, JQT 18(4), 1986.
CUSUM — Page, Continuous Inspection Schemes, Biometrika 41, 1954. Two-sided variant: Hawkins & Olwell, 1998.
ADWIN — Bifet, Learning from Time-Changing Data with Adaptive Windowing, SIAM SDM 2007.
SPOT / DSPOT — Siffer et al., Anomaly Detection in Streams with Extreme Value Theory, KDD 2017.
t-digest — Dunning, Computing Extremely Accurate Quantiles using t-Digests, 2019.
Count-Min Sketch — Cormode & Muthukrishnan, JoA 55(1), 2005.
HyperLogLog — Flajolet, Fusy, Gandouet, Meunier — AofA 2007. HyperLogLog in Practice: Heule, Nunkesser, Hall, EDBT 2013.
Space-Saving — Metwally, Agrawal, El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, ICDT 2005.
Bloom filter — Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM 13(7), 1970. Double-hashing: Kirsch & Mitzenmacher, Less Hashing, Same Performance, ESA 2006.
Matrix Profile / STOMP — Zhu, Zimmerman, Senobari, Yeh, Funning, Mueen, Brisk, Keogh, Matrix Profile II: Exploiting a Novel Algorithm and GPUs…, ICDM 2016. Original MP: Yeh et al., Matrix Profile I, ICDM 2016.
VUS-PR — Paparrizos, Boniol, Palpanas, Tsay, Elmore, Franklin, Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection, VLDB 2022.
TSB-AD-M — Liu, Paparrizos, The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection Benchmark, NeurIPS 2024.
SAGE — Covert, Lundberg, Lee, Understanding Global Feature Contributions Through Additive Importance Measures, NeurIPS 2020.
Welford variance — Welford, Technometrics 4(3), 1962.

The Random Cut Forest implementation conforms to the AWS SageMaker hyperparameter bounds (feature_dim, num_trees, num_samples_per_tree, time_decay) — enforced at build time.

Details: docs/conformance_rcf.md.

Features

Cargo feature	Default	Role
`core`	✅	Re-export of `anomstream-core` (bare forest + primitives)
`std`	✅	Standard library support (unlocks the full module surface)
`triage`	❌	Re-export of `anomstream-triage` (Platt, SAGE, LSH, feedback, audit)
`hotpath`	❌	Re-export of `anomstream-hotpath` (sampler, rate cap, MPSC channel)
`parallel`	❌	Per-tree / batch parallelism via `rayon` (implies `std`)
`serde`	❌	State serialisation
`postcard`	❌	Compact binary persistence (implies `serde`)
`serde_json`	❌	JSON persistence (implies `serde`)
`audit-integrity`	❌	HMAC-SHA256-chained tamper-evident `AuditChain` (pulls `hmac` + `sha2` + `subtle`; implies `triage + std + serde + postcard`)
`full`	❌	Convenience alias for `core + triage + hotpath + std + parallel + serde + postcard + serde_json + audit-integrity`

The facade default is ["core", "std"] — deliberately minimal so consumers pay only for what they import. Enable full for the "everything wired in" deployment or cherry-pick layers explicitly.

The member crates (anomstream-core, anomstream-triage, anomstream-hotpath) ship with default = [] so direct-dep consumers do not pay for std / serde / postcard they never use. The historical "everything on" set is still reachable via the facade's default features or by enabling ["std", "serde", "postcard"] explicitly.

Module availability table

Module / type	Requires feature
`RandomCutForest`, `ThresholdedForest`, `RcfConfig`, `ForestBuilder`	always (core, no_std + alloc)
`OnlineStats`, `Normalizer`, `PerFeatureEwma`, `PerFeatureCusum`, `FeatureDriftDetector`, `MetaDriftDetector`	always (core, no_std + alloc)
`TDigest`, `ScoreHistogram`, `ForensicBaseline`, `SeverityBands`, `AttributionStability`, `BootstrapReport`	always (core, no_std + alloc)
`AdwinDetector`, `PotDetector`, `ensemble::fisher_combine`	`std`
`CountMinSketch`, `HyperLogLog`, `SpaceSaving`, `BloomFilter`	`std`
`ShingledForest`, `DynamicForest`, `DriftAwareForest`, `TenantForestPool`, `MatrixProfile`	`std`
`TsbAdMDataset`, `vus_pr` / `range_auc_pr`	`std`
`AlertClusterer`, `AlertRecord`, `FeedbackStore`, `PlattCalibrator`, `SageEstimator`, `LshAlertClusterer`	`triage` (+ `std`)
`AuditChain`, `AuditChainEntry`, `verify_audit_chain`, `AUDIT_CHAIN_*` consts	`audit-integrity`
`UpdateSampler`, `PrefixRateCap`, `update_channel`, `try_update_channel`, `MetricsSink`, `MAX_CHANNEL_CAPACITY`, `METRICS_BATCH_SIZE`	`hotpath` (+ `std`)

`no_std` + `alloc`

default-features = false drops every std-gated module listed above. The always-available set — the bare forest, ring-buffer sampler, thresholded wrapper, meta / feature drift detectors, t-digest, histogram, forensic baseline, severity bands, and the companion primitives (OnlineStats, Normalizer<D>, PerFeatureEwma<D>, PerFeatureCusum<D>) — runs under #![no_std] with alloc. Transcendentals (ln, sqrt, exp, …) route through num-traits + libm; hashing-dependent code paths fall back to alloc::collections::BTreeMap.

[dependencies]
anomstream = { version = "…", default-features = false, features = ["core"] }
# Optional: serde persistence under no_std
anomstream = { version = "…", default-features = false, features = ["core", "serde"] }

The no_std configuration is gated in CI (cargo check --no-default-features + --features serde).

Performance

See docs/performance.md for the full criterion bench matrix. Benches are split across the three member crates: cargo bench -p anomstream-core --bench modules (detectors + primitives), cargo bench -p anomstream-triage --bench modules (Platt, SAGE, LSH), cargo bench -p anomstream-hotpath --bench modules (sampler, rate cap, channel).

Quality evaluation (TSB-AD-M)

For detection-quality benchmarking, anomstream ships the VUS-PR metric (Paparrizos VLDB 2022) and a TSB-AD-M CSV loader. Dataset isn't bundled — download from https://github.com/thedatumorg/TSB-AD (≈ 1 GiB).

Single-file run:

cargo run --release --example tsb_ad_m_eval -- /path/to/TSB-AD-M/MSL_1_001.csv
# MSL_1_001.csv  n=2000  dim=55  pos=123  VUS-PR=0.4312  elapsed=218ms

Loop over the full folder (bash):

for f in /path/to/TSB-AD-M/*.csv; do
    cargo run --release --example tsb_ad_m_eval -- "$f"
done | tee vus_pr.log

The example uses DynamicForest<128> with a 50 % calibration / 50 % scoring split. Swap in MatrixProfile or your own detector by editing core/examples/tsb_ad_m_eval.rs.

License

Apache-2.0. Contributions under the same licence.

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github		.github
core		core
docs		docs
hotpath		hotpath
meta		meta
triage		triage
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
deny.toml		deny.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anomstream

Scope

Catalogue

Crate layout

Quickstart

Multivariate: Random Cut Forest

Per-feature: two-sided CUSUM change-point

Algorithms

Features

Module availability table

`no_std` + `alloc`

Performance

Quality evaluation (TSB-AD-M)

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

anomstream

Scope

Catalogue

Crate layout

Quickstart

Multivariate: Random Cut Forest

Per-feature: two-sided CUSUM change-point

Algorithms

Features

Module availability table

no_std + alloc

Performance

Quality evaluation (TSB-AD-M)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

`no_std` + `alloc`