Avalonia 12 desktop app that captures Windows audio output (any app's playback), streams it to OpenAI's realtime translation model (gpt-realtime-translate), and renders the translated audio + dual-language transcript live. Windows-only today; the audio layer is being abstracted for a macOS port.
- .NET 9 + Avalonia 12 (Fluent theme, Inter font fallback)
- NAudio —
WasapiLoopbackCapturefor input,WasapiOut/WaveOutEventfor playback,WdlResamplingSampleProviderfor high-quality 48 kHz → 24 kHz resampling MessageBox.Avalonia— dialog replacement (Avalonia has no built-inMessageBox)System.Net.WebSockets.ClientWebSocket(built-in) for the realtime APISystem.Text.Json(built-in) for protocol serialization
[Any Windows app] ──► WasapiLoopbackCapture ──► downmix ──► WDL resample to 24 kHz ──► PCM16
│
▼
ClientWebSocket → OpenAI Realtime
│
┌───────────────────────────────────────────────────────────────┴───┐
▼ ▼
translated audio (PCM16) dual transcript deltas
│ │
▼ ▼
WasapiOut device Avalonia TextBox
- Windows 10 / 11
- .NET 9 SDK
- An OpenAI API key with access to
gpt-realtime-translate
dotnet restore
dotnet runOn first launch, click the API… button in the settings panel and paste
your sk-… key. The key is stored locally at
%APPDATA%\Babelive\settings.json (plain JSON, never transmitted anywhere
except to the configured API endpoint).
For a self-contained release build:
dotnet publish -c ReleaseProduces a single Babelive.exe at
bin\Release\net9.0-windows\win-x64\publish\ (~80–90 MB, bundles the
.NET 9 runtime + Avalonia/Skia/HarfBuzz native libs, single-file compressed).
Just ship that one file.
- Pick a target language.
- Pick a Capture source — recommended is All system audio (no echo) which uses Win10 build 20348+ Process Loopback to exclude Babelive's own playback. Per-app entries (Teams, Chrome, Spotify, …) and legacy device loopbacks are also available.
- Pick a Playback device for the translated audio. Read the feedback warning below.
- Click Start, then play any video / call / song.
The settings window is hide-on-close — closing it leaves the lyric overlay + tray icon running. Exit from the tray menu fully quits.
Teams and Skype set AUDCLNT_STREAMFLAGS_PREVENT_LOOPBACK_CAPTURE on their call audio for privacy, so Windows' Process Loopback API returns silence for them. Babelive auto-detects this and, if VB-CABLE is installed, redirects the Teams/Skype process tree to CABLE Input via IAudioPolicyConfig per-app routing, then loopback-captures from the cable. No manual Teams/Skype audio config needed.
Without VB-CABLE installed, Teams/Skype audio cannot be captured — this is a Windows DRM-style restriction, not a Babelive bug.
Zoom / Discord / Google Meet / WebEx / Slack use WebRTC and don't set the flag — they work via plain Process Loopback.
If translated audio plays through the same speakers you're capturing, the loopback re-translates it forever. Three fixes:
- Use headphones for playback (different physical device than the captured speakers).
- Install VB-CABLE — free virtual audio cable. Send the source app's output to
CABLE Input; Babelive can then loopback-capture the cable while playing translation through your real speakers / headphones. - Tick "Transcript only" — only spoken text appears, nothing replays.
Babelive/
├── Babelive.csproj
├── Program.cs ← Avalonia entry point
├── App.axaml / App.axaml.cs ← single-instance gate + window/tray bootstrap
├── MainWindow.axaml / .axaml.cs ← settings window + audio orchestration
├── LyricWindow.axaml / .axaml.cs ← transparent topmost desktop-lyrics overlay
├── ApiSettingsWindow.axaml / .axaml.cs ← API endpoint + key dialog
├── TrayIconHost.cs ← system tray (Avalonia TrayIcon + NativeMenu)
├── AppIcon.cs ← runtime-generated amber 译 disc icon
├── AppSettings.cs / LanguageCodes.cs ← persisted prefs + dropdown options
├── Styles/
│ └── Controls.axaml ← Fluent overrides (buttons, combos, etc.)
├── Audio/
│ ├── LoopbackCapture.cs ← WASAPI / Process Loopback → 24 kHz mono PCM16
│ ├── AudioPlayer.cs ← plays translated PCM16 chunks
│ ├── AudioDucker.cs ← lowers other apps' session volumes during translation
│ ├── EndpointMuter.cs ← "Mute other speakers" — driver-stage endpoint mute
│ ├── DefaultDeviceSetter.cs ← IPolicyConfigVista default-render-device override
│ ├── AppAudioRouter.cs ← IAudioPolicyConfig per-app device routing (Teams)
│ ├── ProcessLoopbackCapture.cs ← Win10 20348+ process-include/exclude loopback
│ └── ProcessTree.cs ← Toolhelp32 process-tree walker (Teams PID family)
└── Translation/
└── RealtimeTranslatorClient.cs ← async ClientWebSocket
The realtime translation API is new. The exact event/field names in RealtimeTranslatorClient.cs are best-effort based on https://developers.openai.com/api/docs/guides/realtime-translation plus the standard /v1/realtime event conventions. If your account sees errors:
- Endpoint: defaults to
wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate. Tick "Alt endpoint" in the UI to fall back towss://api.openai.com/v1/realtime?model=gpt-realtime-translate. - Session config:
RealtimeTranslatorClient.SendSessionUpdateAsyncsendssession.updatewithinput_audio_format=pcm16,output_audio_format=pcm16, andtranslation.target_language=<code>. Adjust if the official schema differs. - Event names:
Dispatchmatches both theoutput_*.deltaandresponse.output_*.deltashapes. If transcripts/audio don't arrive, log every incoming event and adjust.
Open YouTube in any non-target language, hit Start, and the translation should start streaming into the lyric overlay (and the settings window's transcript panes) within a second or two of the source audio playing.