Skip to content

avencera/speakrs

Repository files navigation

speakrs

Fast Rust speaker diarization (who spoke when) with pyannote-level accuracy.

On VoxConverse dev, speakrs CoreML gets 7.1% DER at 529x realtime versus pyannote's 7.2% at 24x. Full results are in benchmarks/.

If you want a small end-to-end app using it, see avencera/smrze.

Overview

speakrs implements the full pyannote community-1 style diarization pipeline in Rust: segmentation, powerset decode, overlap-add aggregation, binarization, embedding, PLDA, and VBx clustering.

There is no Python runtime in the library path. Inference runs on ONNX Runtime or native CoreML, and the rest of the pipeline stays in Rust.

Usage

# macOS (CoreML)
speakrs = { version = "0.4", features = ["coreml"] }

# NVIDIA GPU
speakrs = { version = "0.4", features = ["cuda"] }

# CPU only
speakrs = "0.4"

# System OpenBLAS
speakrs = { version = "0.4", default-features = false, features = ["online", "openblas-system"] }

Quick start

use speakrs::{ExecutionMode, OwnedDiarizationPipeline};

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let mut pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;

    let audio: Vec<f32> = load_your_mono_16khz_audio_here();
    let result = pipeline.run(&audio)?;

    print!("{}", result.rttm("my-audio"));
    Ok(())
}

Speaker turns

use speakrs::pipeline::{FRAME_DURATION_SECONDS, FRAME_STEP_SECONDS};

let result = pipeline.run(&audio)?;

for segment in result
    .discrete_diarization
    .to_segments(FRAME_STEP_SECONDS, FRAME_DURATION_SECONDS)
{
    println!("{:.3} - {:.3}  {}", segment.start, segment.end, segment.speaker);
}

Background queue

[QueueSender] and [QueueReceiver] run a background worker. Push audio from any thread and read results as they finish:

use speakrs::{ExecutionMode, OwnedDiarizationPipeline, QueuedDiarizationRequest};

let pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;
let (tx, rx) = pipeline.into_queued()?;

std::thread::spawn(move || {
    for (file_id, audio) in receive_files() {
        tx.push(QueuedDiarizationRequest::new(file_id, audio)).unwrap();
    }
});

for result in rx {
    let result = result?;
    print!("{}", result.result?.rttm(&result.file_id));
}

Local models

For offline or airgapped setups, load models from a local directory:

use std::path::Path;
use speakrs::{ExecutionMode, OwnedDiarizationPipeline};

let mut pipeline = OwnedDiarizationPipeline::from_dir(
    Path::new("/path/to/models"),
    ExecutionMode::Cpu,
)?;
let result = pipeline.run(&audio)?;

Choosing a mode

Mode Backend Step Use it for
cpu ONNX Runtime CPU 1s CPU runs and widest compatibility
coreml Native CoreML 1s macOS with CoreML acceleration
coreml-fast Native CoreML 2s macOS with CoreML acceleration and higher throughput
cuda ONNX Runtime CUDA 1s NVIDIA GPU
cuda-fast ONNX Runtime CUDA 2s NVIDIA GPU for higher throughput

The *-fast modes move the segmentation window every 2 seconds instead of every 1 second. That gives the pipeline fewer windows to score, so it can be much faster, but speaker changes may land a little farther from the exact word or pause where they happened.

Use the 1 second modes when you care about exactly when each speaker starts and stops, short clips, interviews with quick back-and-forth, or audio you plan to subtitle or edit. The 2 second modes are usually worth trying for long recordings where speed matters more than exact speaker-change times, such as meetings, lectures, podcasts, or bulk archives.

Benchmarks

VoxConverse dev, collar=0ms:

Platform Implementation DER Time RTFx
Apple M4 Pro speakrs coreml 7.1% 138s 529x
Apple M4 Pro speakrs coreml-fast 7.4% 169s 434x
Apple M4 Pro pyannote community-1 (MPS) 7.2% 2999s 24x
RTX 4090 speakrs cuda 7.0% 1236s 59x
RTX 4090 speakrs cuda-fast 7.4% 604s 121x
RTX 4090 pyannote community-1 (CUDA) 7.2% 2312s 32x

On VoxConverse test, coreml matches pyannote at 11.1% DER and runs at 631x realtime versus pyannote's 23x. cuda matches pyannote at 11.1% DER and runs at 50x realtime versus pyannote's 18x. See benchmarks/ for the full tables across all datasets.

CoreML and ONNX Runtime can differ slightly even in FP32 because the runtime graphs are not identical and floating-point reduction order changes rounding.

Why not pyannote-rs?

pyannote-rs is the main Rust-only comparison point, but it targets a different tradeoff.

speakrs pyannote-rs
Pipeline Full pyannote community-1 style pipeline Simpler window-level pipeline
Aggregation Overlap-add plus binarization No overlap-add or binarization
Clustering PLDA + VBx Cosine threshold
Goal Stay close to pyannote behavior on CPU/CUDA Lightweight Rust diarization

On the VoxConverse dev subset where pyannote-rs emits output, speakrs CoreML scores 11.5% DER versus 80.2% for pyannote-rs. In that same run, pyannote-rs returned no segments on most files.

Models

With the default online feature, models download on first use from avencera/speakrs-models. Set SPEAKRS_MODELS_DIR if you want to force a local bundle instead.

Features and build notes

Common features:

  • online (default): model download via [ModelManager]
  • coreml: native CoreML backend on macOS
  • cuda: NVIDIA CUDA backend via ONNX Runtime
  • load-dynamic: load the CUDA runtime at startup instead of static linking

BLAS backends matter if you disable default features:

  • x86_64 defaults to statically linked Intel MKL
  • non-x86_64 defaults to statically linked OpenBLAS and needs a C toolchain
  • no-default builds must enable exactly one of intel-mkl, openblas-static, or openblas-system
speakrs = { version = "0.4", default-features = false, features = ["online", "intel-mkl"] }
speakrs = { version = "0.4", default-features = false, features = ["online", "openblas-system"] }

The ONNX Runtime dependency (ort 2.0.0-rc.12) is still pre-release.

Public API

Start here:

  • [OwnedDiarizationPipeline]: pipeline entry point
  • [QueueSender] and [QueueReceiver]: background worker interface
  • [DiarizationResult]: frame-level activations, segments, clusters, embeddings, RTTM
  • [PipelineConfig] and [RuntimeConfig]: tuning knobs
  • [ModelManager]: model download when online is enabled
  • [Segment]: a single speaker turn

See CONTRIBUTING.md for local setup, model downloads, fixture generation, and the standard check commands used in this repo.

References

About

Speaker diarization in Rust. 312–912x realtime on Apple Silicon, 50–121x on CUDA. Matches pyannote accuracy.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors