Skip to content

mybigday/whisper.node

Repository files navigation

whisper.node

CI NPM Version NPM Downloads

An another Node binding of whisper.cpp to make same API with whisper.rn as much as possible.

  • whisper.cpp: Automatic speech recognition with multi-platform support
  • whisper.rn: React Native binding of whisper.cpp

Platform Support

  • macOS
    • arm64: CPU and Metal GPU acceleration
    • x86_64: CPU only
  • Windows (x86_64 and arm64)
    • CPU
    • GPU acceleration via Vulkan
    • GPU acceleration via CUDA (x86_64)
  • Linux (x86_64 and arm64)
    • CPU
    • GPU acceleration via Vulkan
    • GPU acceleration via CUDA
  • Web
    • WASM
    • Optional WebGPU through ggml-webgpu when the WASM package is built with GGML_WEBGPU=ON

Installation

npm install @fugood/whisper.node

Usage

Basic Transcription

import { initWhisper } from '@fugood/whisper.node'

const context = await initWhisper({
  model: 'path/to/ggml-base.en.bin',
  useGpu: true,
}, libVariant)

// transcribeFile returns { stop, promise }
const { stop: stop1, promise: promise1 } = context.transcribeFile('audio1.wav', {
  language: 'en',
  temperature: 0.0,
  // ...
})

const result1 = await promise1

// transcribeData also returns { stop, promise }
let audioBuffer // PCM 16-bit, mono, 16kHz
const { stop: stop2, promise: promise2 } = context.transcribeData(audioBuffer, {
  language: 'en',
  temperature: 0.0,
  // ...
})

const result2 = await promise2

// You can also cancel transcription if needed
// await stop1() // Cancels the first transcription
// await stop2() // Cancels the second transcription

// Always release the context when done
await context.release()

Voice Activity Detection (VAD)

import { initWhisperVad } from '@fugood/whisper.node'

// Context-based VAD (for multiple detections)
const vadContext = await initWhisperVad({
  model: 'path/to/ggml-vad.bin',
  useGpu: true,
  nThreads: 2
}, libVariant)

const result = await vadContext.detectSpeechFile('audio.wav')

const result2 = await vadContext.detectSpeechData(audioBuffer)
await vadContext.release()

Note: Audio data should be 16-bit PCM, mono, 16kHz format. The library expects ArrayBuffer containing raw audio data.

Native Logs

import {
  addNativeLogListener,
  isNativeLogEnabled,
  toggleNativeLog,
} from '@fugood/whisper.node'

const logs = addNativeLogListener((level, text) => {
  console.log(`[whisper ${level}] ${text}`)
})

await toggleNativeLog(true)
console.log(isNativeLogEnabled())

// ...

await toggleNativeLog(false)
logs.remove()

Log levels are emitted as lowercase error, warn, info, or debug strings. The same helpers are available in Node.js and browser WASM builds.

Browser WASM

The browser package keeps the same promise-based initWhisper and initWhisperVad entry points. In browsers, filePath is treated as a URL and the model is fetched into the WASM filesystem.

import { initWhisper, initWhisperVad } from '@fugood/whisper.node'

const whisper = await initWhisper({
  filePath: 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin',
  maxModelBytes: 1536 * 1024 * 1024,
  useGpu: false,
})

const { promise } = whisper.transcribeFile('https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/samples/jfk.wav', {
  language: 'en',
  temperature: 0,
})

console.log(await promise)
await whisper.release()

const vad = await initWhisperVad({
  filePath: 'https://huggingface.co/ggml-org/whisper-vad/resolve/main/ggml-silero-v6.2.0.bin',
  useGpu: false,
})
console.log(await vad.detectSpeechFile('https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/samples/jfk.wav'))
await vad.release()

The browser package ships both single-thread and pthread WASM artifacts. On cross-origin isolated pages (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp), the loader uses the pthread artifact with SharedArrayBuffer; otherwise it falls back to the single-thread artifact automatically. Oversized model downloads fail before loading into MEMFS. Firefox is capped at 256 MiB by default; other browsers default to 75% of the configured WASM maximum memory. Pass maxModelBytes only when you know the target browser can allocate the model. Whisper transcription defaults to up to 8 threads based on browser hardware concurrency when pthreads are available; pass maxThreads to override it. Browser WASM clamps maxThreads to the compiled pool limit of 8, or 1 in the single-thread fallback. Browser pages run model loading, transcription, benchmarks, and VAD in a dedicated module worker by default so the UI thread can keep rendering. Use the main whisper.node package entrypoint in browser code too:

import { configureWasm, initWhisper } from '@fugood/whisper.node'

Use configureWasm({ worker: false }) only when you explicitly need the in-thread runtime, configureWasm({ threads: false }) to force the single-thread artifact, or pass workerPath, jsPath, and wasmPath when serving the package files from custom URLs. The older workerUrl and runtimeScriptUrl option names still work. Model downloads are cached in browser Cache Storage by default. Pass cacheModel: false to disable persistent caching, modelCacheName to isolate the cache namespace, or modelCacheKey when the fetch URL is a proxy or signed URL but should reuse the same cached model.

Build the browser package with:

npm run build-wasm

Or build with the Emscripten Docker image:

npm run build-wasm-docker

npm run build-wasm enables GGML_WEBGPU=ON by default and emits wasm/whisper-node.js, wasm/whisper-node.wasm, wasm/whisper-node.threads.js, and wasm/whisper-node.threads.wasm. Use bash scripts/build-wasm.sh --no-webgpu for a CPU-only WASM build, or --no-threads / --threads to build only one CPU threading variant. Pass --single-file only when you want the WASM binary embedded into each generated JS file. Modern Emscripten embeds the pthread worker bootstrap in the main JS file, so a separate whisper-node.worker.js is not expected. The browser package also ships its own module worker.js wrapper for non-blocking model load and inference. npm run build-wasm-docker uses emscripten/emsdk:4.0.14-arm64 on arm64 hosts such as Apple Silicon Macs, and emscripten/emsdk:4.0.13 on amd64 hosts. Override with EMSCRIPTEN_IMAGE or EMSCRIPTEN_PLATFORM when needed. A local smoke page is available after building:

node examples/wasm/server.mjs

In the WASM package, useGpu: true enables WebGPU for whisper transcription when the browser supports navigator.gpu. VAD currently falls back to CPU in the browser package because the Silero VAD graph hits unsupported WebGPU ops.

Lib Variants

  • default: General usage, not support GPU except macOS (Metal)
  • vulkan: Support GPU Vulkan (Windows/Linux), but some scenario might unstable
  • cuda: Support GPU CUDA (Windows/Linux), but only for limited capability

    Linux: (x86_64: 8.9, arm64: 8.7) Windows: x86_64 - 12.0

License

MIT


Built and maintained by BRICKS.

About

An another Node binding of whisper.cpp to make same API with whisper.rn as much as possible.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors