This is the accompanying code for my blog post on Denoising AutoEncoders.
A PyTorch implementation of speech enhancement models, starting with a Denoising Autoencoder (DAE) following Lu et al. (2013) - Speech Enhancement Based on Deep Denoising Autoencoder — with architecture choices informed by Nossier et al. (2020) - An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement — and extending to a time-domain Wave-U-Net (Stoller et al., 2018).
The API documentation can be found here.
Real-world speech is corrupted by additive noise — fan hum, traffic, background chatter — that degrades both intelligibility and downstream processing such as ASR or speaker verification.
A denoising autoencoder learns a direct mapping from noisy spectral features to clean spectral features. During training the model sees pairs (noisy_frame, clean_frame) and is penalised for any reconstruction error. At inference time only the noisy side is available: the encoder compresses it into a bottleneck representation, and the decoder reconstructs a clean estimate from that representation.
Noisy speech frame Clean speech estimate
│ ▲
▼ │
┌─────────────┐ bottleneck z ┌──────────────┐
│ Encoder │ ─────────────────► │ Decoder │
└─────────────┘ └──────────────┘
The model implemented and used in production is simpleAE_logmag — a fully-connected denoising autoencoder operating on log-magnitude spectrogram frames.
Each utterance is resampled to 8 kHz and transformed via a short-time Fourier transform (256-sample Hann window, 128-sample hop). One frame therefore spans 256 / 2 + 1 = 129 frequency bins. The log-magnitude of each STFT frame forms the model input:
PCM waveform (8 kHz)
│
▼ STFT (n_fft=256, hop=128, Hann window)
Complex spectrogram (129, n_frames)
│
▼ log(|·| + ε)
Log-magnitude spectrogram (129, n_frames)
│
▼ one frame per sample
Input vector shape: (129,)
Working in the log-magnitude domain offers a compact, perceptually motivated representation: the logarithm compresses the wide dynamic range of speech, and the frame-level granularity keeps the input size small enough for a fully-connected network.
The encoder and decoder are symmetric stacks of fully-connected layers with ReLU activations, LayerNorm, and no dropout (as per Nossier et al. architecture (d)):
| Stage | Layer sizes |
|---|---|
| Input | 129 |
| Encoder | 2048 → 500 → 180 (bottleneck) |
| Decoder | 180 → 500 → 2048 → 129 |
LayerNorm is applied to the input and after each linear layer. The bottleneck dimension of 180 gives a compression ratio of roughly 7×.
Training pairs are synthesised on the fly. For each utterance a random excerpt from the DEMAND noise corpus is mixed with the clean speech at a uniformly sampled SNR. All 18 DEMAND environments are used by default. The same noise pool is shared between train and validation; random draw offsets ensure each sample is unique.
The model is trained with Adam and MSE reconstruction loss. A ReduceLROnPlateau scheduler halves the learning rate after two epochs without validation improvement. Key hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 40 |
| Batch size | 256 |
| Sampling rate | 8 000 Hz |
| STFT window / hop | 256 / 128 samples |
| Bottleneck dim | 180 |
| Optimizer | Adam |
| LR schedule | ReduceLROnPlateau (patience=2, factor=0.5) |
TensorBoard logs (training loss, validation loss, SNR improvement, gradient norms) are written to runs/ and can be inspected with tensorboard --logdir runs.
At inference time each frame is denoised independently. To recover a waveform the enhanced log-magnitude spectrum is exponentiated back to a magnitude spectrum, the original noisy phase is re-applied, and librosa.istft inverts the result. This phase-borrowing approach avoids the iterative Griffin-Lim procedure while still producing intelligible output.
speech_enhancement/
├── src/
│ └── libdse/
│ ├── nets.py # VanillaAutoEncoder, WaveUNet
│ ├── evaluation.py # Evaluation metrics (PESQ, STOI)
│ ├── data/
│ │ ├── features.py # Feature extractors (log-mag, mel, raw)
│ │ ├── librispeech.py # LibriSpeechDataset (IterableDataset)
│ │ ├── noise.py # DEMANDNoiseDataset, add_noise_snr
│ │ └── err.py # Custom exceptions
│ ├── train/
│ │ └── dae.py # DAE training script + hyperparameters
│ └── showcases/
│ └── dae.py # Gradio demo app
├── Dockerfile # Containerised Gradio demo
├── models/
│ └── simple_autoencoder_logmag_spec_noisy_clean # Trained DAE checkpoint
├── data/
│ ├── train-clean-100/ # LibriSpeech training corpus
│ ├── test-clean/ # LibriSpeech test corpus
│ └── noise/DEMAND/ # DEMAND noise recordings
├── tests/
│ ├── resources/ # Small FLAC fixtures
│ └── test_dataset.py
└── pyproject.toml
Requires Python ≥ 3.12.
git clone <repo-url>
cd speech_enhancement
# Install (uv recommended)
uv pip install -e .
# Or with pip
pip install -e .Download LibriSpeech train-clean-100 (~6.3 GB) and LibriSpeech test-clean, then extract them under data/. Download the DEMAND corpus and place it under data/noise/DEMAND/. The expected layout:
data/
├── train-clean-100/LibriSpeech/train-clean-100/<speaker>/<chapter>/*.flac
├── test-clean/LibriSpeech/test-clean/<speaker>/<chapter>/*.flac
└── noise/DEMAND/<ENVIRONMENT>/*.wav
python -m libdse.train.daeThe checkpoint with the best validation loss is saved to models/simple_autoencoder_logmag_spec_noisy_clean.
A pre-trained checkpoint is included in models/. Launch the interactive demo with:
python -m libdse.showcases.daeAlternatively, run the containerised version with Docker:
docker build -t libdse-demo .
docker run -p 7860:7860 libdse-demoThen open http://localhost:7860 in your browser.
The app exposes two tabs:
- Denoise — upload any audio file; the model denoises it and displays spectrograms of the input and output side-by-side.
- Noise mix — upload clean speech, choose a DEMAND environment and a target SNR, and listen to the resulting noisy mixture.
uv run pytest- Synthesise noisy training pairs (LibriSpeech + DEMAND)
- Fully-connected DAE on log-magnitude spectrogram frames
- Training loop with MSE loss, LR scheduling, TensorBoard logging
- Waveform reconstruction via phase borrowing +
istft - Gradio demo app
- Containerise the Gradio app for server deployment
- Wave-U-Net architecture (
libdse.nets.WaveUNet) - Wave-U-Net training script
- Wave-U-Net validation & evaluation