UnSE: Unsupervised Speech Enhancement using Optimal Transport Online Supplement

Authors

Wenbin Jiang, Fei Wen, Yifan Zhang, Kai Yu

Abstract

Most deep learning-based speech enhancement methods usually use supervised learning, which requires massive noisy-to-clean training pairs. However, the synthesized training data can only partially cover some realistic environments, and it is generally difficult or almost impossible to collect pairs of noisy and ground-truth clean speech in some scenarios. To address this problem, we propose an unsupervised speech enhancement method that does not require any paired noisy-to-clean training data. Specifically, based on the optimal transport criterion, the speech enhancement model is trained in an unsupervised manner only using a noisy speech based fidelity loss and a distribution divergence loss, by which the divergence between the output and (unpaired) clean speech is minimized. Experimental results show that the proposed unsupervised method can achieve competitive performance with supervised methods on the VCTK + DEMAND benchmark and better performance on the CHiME4 benchmark.

Datasets

The VCTK+DEMAND dataset is used for demo.

Audio samples of the test set we processed are available at the repository (VCTK).

Setups

The neural network architecture of the denoising model (i.e., generator) and discriminator are detailed in generator.py and discriminator.py, respectively.

The configurations of the both models are detailed in model_arch.py.

Compared methods

OMLSA: Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging

SEGAN: Speech Enhancement Generative Adversarial Network

SASEGAN: Self-Attention Generative Adversarial Network for Speech Enhancement

DOTN: Discriminator-Constrained Optimal Transport Network

Audio Samples