Fast Rust speaker diarization (who spoke when) with pyannote-level accuracy.
On VoxConverse dev, speakrs CoreML gets 7.1% DER at 529x realtime versus pyannote's 7.2% at 24x. Full results are in benchmarks/.
If you want a small end-to-end app using it, see avencera/smrze.
speakrs implements the full pyannote community-1 style diarization
pipeline in Rust: segmentation, powerset decode, overlap-add aggregation,
binarization, embedding, PLDA, and VBx clustering.
There is no Python runtime in the library path. Inference runs on ONNX Runtime or native CoreML, and the rest of the pipeline stays in Rust.
# macOS (CoreML)
speakrs = { version = "0.4", features = ["coreml"] }
# NVIDIA GPU
speakrs = { version = "0.4", features = ["cuda"] }
# CPU only
speakrs = "0.4"
# System OpenBLAS
speakrs = { version = "0.4", default-features = false, features = ["online", "openblas-system"] }use speakrs::{ExecutionMode, OwnedDiarizationPipeline};
fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
let mut pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;
let audio: Vec<f32> = load_your_mono_16khz_audio_here();
let result = pipeline.run(&audio)?;
print!("{}", result.rttm("my-audio"));
Ok(())
}use speakrs::pipeline::{FRAME_DURATION_SECONDS, FRAME_STEP_SECONDS};
let result = pipeline.run(&audio)?;
for segment in result
.discrete_diarization
.to_segments(FRAME_STEP_SECONDS, FRAME_DURATION_SECONDS)
{
println!("{:.3} - {:.3} {}", segment.start, segment.end, segment.speaker);
}[QueueSender] and [QueueReceiver] run a background worker. Push audio
from any thread and read results as they finish:
use speakrs::{ExecutionMode, OwnedDiarizationPipeline, QueuedDiarizationRequest};
let pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;
let (tx, rx) = pipeline.into_queued()?;
std::thread::spawn(move || {
for (file_id, audio) in receive_files() {
tx.push(QueuedDiarizationRequest::new(file_id, audio)).unwrap();
}
});
for result in rx {
let result = result?;
print!("{}", result.result?.rttm(&result.file_id));
}For offline or airgapped setups, load models from a local directory:
use std::path::Path;
use speakrs::{ExecutionMode, OwnedDiarizationPipeline};
let mut pipeline = OwnedDiarizationPipeline::from_dir(
Path::new("/path/to/models"),
ExecutionMode::Cpu,
)?;
let result = pipeline.run(&audio)?;| Mode | Backend | Step | Use it for |
|---|---|---|---|
cpu |
ONNX Runtime CPU | 1s | CPU runs and widest compatibility |
coreml |
Native CoreML | 1s | macOS with CoreML acceleration |
coreml-fast |
Native CoreML | 2s | macOS with CoreML acceleration and higher throughput |
cuda |
ONNX Runtime CUDA | 1s | NVIDIA GPU |
cuda-fast |
ONNX Runtime CUDA | 2s | NVIDIA GPU for higher throughput |
The *-fast modes move the segmentation window every 2 seconds instead of
every 1 second. That gives the pipeline fewer windows to score, so it can be much faster, but speaker changes
may land a little farther from the exact word or pause where they happened.
Use the 1 second modes when you care about exactly when each speaker starts and stops, short clips, interviews with quick back-and-forth, or audio you plan to subtitle or edit. The 2 second modes are usually worth trying for long recordings where speed matters more than exact speaker-change times, such as meetings, lectures, podcasts, or bulk archives.
VoxConverse dev, collar=0ms:
| Platform | Implementation | DER | Time | RTFx |
|---|---|---|---|---|
| Apple M4 Pro | speakrs coreml |
7.1% | 138s | 529x |
| Apple M4 Pro | speakrs coreml-fast |
7.4% | 169s | 434x |
| Apple M4 Pro | pyannote community-1 (MPS) | 7.2% | 2999s | 24x |
| RTX 4090 | speakrs cuda |
7.0% | 1236s | 59x |
| RTX 4090 | speakrs cuda-fast |
7.4% | 604s | 121x |
| RTX 4090 | pyannote community-1 (CUDA) | 7.2% | 2312s | 32x |
On VoxConverse test, coreml matches pyannote at 11.1% DER and runs at
631x realtime versus pyannote's 23x. cuda matches pyannote at 11.1% DER
and runs at 50x realtime versus pyannote's 18x. See
benchmarks/ for
the full tables across all datasets.
CoreML and ONNX Runtime can differ slightly even in FP32 because the runtime graphs are not identical and floating-point reduction order changes rounding.
pyannote-rs is the main Rust-only comparison point, but it targets a different tradeoff.
speakrs |
pyannote-rs |
|
|---|---|---|
| Pipeline | Full pyannote community-1 style pipeline |
Simpler window-level pipeline |
| Aggregation | Overlap-add plus binarization | No overlap-add or binarization |
| Clustering | PLDA + VBx | Cosine threshold |
| Goal | Stay close to pyannote behavior on CPU/CUDA | Lightweight Rust diarization |
On the VoxConverse dev subset where pyannote-rs emits output, speakrs
CoreML scores 11.5% DER versus 80.2% for pyannote-rs. In that same run,
pyannote-rs returned no segments on most files.
With the default online feature, models download on first use from
avencera/speakrs-models.
Set SPEAKRS_MODELS_DIR if you want to force a local bundle instead.
Common features:
online(default): model download via [ModelManager]coreml: native CoreML backend on macOScuda: NVIDIA CUDA backend via ONNX Runtimeload-dynamic: load the CUDA runtime at startup instead of static linking
BLAS backends matter if you disable default features:
x86_64defaults to statically linked Intel MKL- non-
x86_64defaults to statically linked OpenBLAS and needs a C toolchain - no-default builds must enable exactly one of
intel-mkl,openblas-static, oropenblas-system
speakrs = { version = "0.4", default-features = false, features = ["online", "intel-mkl"] }
speakrs = { version = "0.4", default-features = false, features = ["online", "openblas-system"] }The ONNX Runtime dependency (ort 2.0.0-rc.12) is still pre-release.
Start here:
- [
OwnedDiarizationPipeline]: pipeline entry point - [
QueueSender] and [QueueReceiver]: background worker interface - [
DiarizationResult]: frame-level activations, segments, clusters, embeddings, RTTM - [
PipelineConfig] and [RuntimeConfig]: tuning knobs - [
ModelManager]: model download whenonlineis enabled - [
Segment]: a single speaker turn
See CONTRIBUTING.md for local setup, model downloads, fixture generation, and the standard check commands used in this repo.
- pyannote-audio - Python reference implementation
- pyannote community-1 - VBx + PLDA pipeline
- SpeakerKit - Swift reference (same VBx architecture)