Speech Enhancement with Integration of Neural Homomorphic Synthesis and Spectral Masking

Speech enhancement refers to suppressing the background noise to improve the perceptual quality and intelligibility of the observed noisy speech. Recently, speech enhancement algorithms based on deep neural networks (DNNs) have replaced traditional algorithms based on statistical signal processing and become the mainstream in the research field. However, most DNN-based speech enhancement methods commonly operate on the frequency domain and do not use the speech production model, which makes the models prone to under-suppress the noise or over-suppress the speech. To address the shortcoming, we propose a novel speech enhancement method integrating neural homomorphic synthesis and complex spectral masking. Specifically, we use a shared-encoder and multi-decoder neural network architecture. For the neural homomorphic synthesis, the speech signal is separated into excitation and vocal tract components through liftering the cepstrum, two DNN decoders are applied to estimate the target components independently, and the denoised speech is synthesized by the estimated minimum-phase signal and the noisy phase. For the spectral masking, another DNN decoder is adopted to estimate the complex mask of the target spectrum, and the denoised speech spectrum is obtained by masking the noisy spectrum. The two branches respectively estimate speech signals, and the final enhanced speech is obtained by merging the two branches of estimated speech. Experimental results on two popular datasets show that the proposed method achieves state-of-the-art performance in most evaluation metrics, with only 920K model parameters.

Model\id(noise)	p232_093(living)	p257_024(bus)	p257_048(cafe)	p232_393(office)	p232_409(psquare)
Clean
Noisy
DEMUCS
MetricGAN+
DCCRN
DB-AIAT
FRCRN
Ours

Model\id	no_reverb_fileid_80	no_reverb_fileid_220	no_reverb_fileid_238	with_reverb_fileid_66	with_reverb_fileid_102
Clean
Noisy
NSNet2
DEMUCS
FullSubNet+
Ours

Speech Enhancement with Integration of Neural Homomorphic Synthesis and Spectral Masking Online Supplement

Authors

Abstract

Datasets

Compared methods

Audio Samples

VoiceBank+DEMAND

Spectrogram of the samples in first column

DNS-Challenge

Spectrogram of the samples in first column