Speech Enhancement with Integration of Neural Homomorphic Synthesis and Spectral Masking Online Supplement

Authors

Wenbin Jiang, Kai Yu

Abstract

Speech enhancement refers to suppressing the background noise to improve the perceptual quality and intelligibility of the observed noisy speech. Recently, speech enhancement algorithms based on deep neural networks (DNNs) have replaced traditional algorithms based on statistical signal processing and become the mainstream in the research field. However, most DNN-based speech enhancement methods commonly operate on the frequency domain and do not use the speech production model, which makes the models prone to under-suppress the noise or over-suppress the speech. To address the shortcoming, we propose a novel speech enhancement method integrating neural homomorphic synthesis and complex spectral masking. Specifically, we use a shared-encoder and multi-decoder neural network architecture. For the neural homomorphic synthesis, the speech signal is separated into excitation and vocal tract components through liftering the cepstrum, two DNN decoders are applied to estimate the target components independently, and the denoised speech is synthesized by the estimated minimum-phase signal and the noisy phase. For the spectral masking, another DNN decoder is adopted to estimate the complex mask of the target spectrum, and the denoised speech spectrum is obtained by masking the noisy spectrum. The two branches respectively estimate speech signals, and the final enhanced speech is obtained by merging the two branches of estimated speech. Experimental results on two popular datasets show that the proposed method achieves state-of-the-art performance in most evaluation metrics, with only 920K model parameters.

Datasets

  • The VoiceBank+DEMAND and the DNS-Challenge datasets are used for demo.
  • Audio samples of the two test sets we processed are available at the repository (voicebank, dns2020).

  • Compared methods

  • DEMUCS: Real Time Speech Enhancement in the Waveform Domain
  • MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
  • DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
  • DB-AIAT: Dual-branch Attention-In-Attention Transformer for Single-channel Speech Enhancement
  • NSNet2: Noise Suppression Net 2 (NSNet2) baseline
  • FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement
  • FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement

  • Audio Samples


    VoiceBank+DEMAND

    Model\id(noise) p232_093(living) p257_024(bus) p257_048(cafe) p232_393(office) p232_409(psquare)
    Clean
    Noisy
    DEMUCS
    MetricGAN+
    DCCRN
    DB-AIAT
    FRCRN
    Ours

    Spectrogram of the samples in first column




    DNS-Challenge

    Model\id no_reverb_fileid_80 no_reverb_fileid_220 no_reverb_fileid_238 with_reverb_fileid_66 with_reverb_fileid_102
    Clean
    Noisy
    NSNet2
    DEMUCS
    FullSubNet+
    Ours

    Spectrogram of the samples in first column