Speech Enhancement in TTS systems

Google Winter of Code 2022

This was a winter of code research project aimed at speech enhancement of speech generated by text-to-speech models.

The speech generated by many TTS models had some ambient noise and noise-like artifacts. We worked on post-processing to reduce and remove those artifacts. Along with removing the noise, we also wished to quantify the noise and how clear our audio became after we applied our method. So we were also interested in developing metrics for quantifying the speech clarity.

Datasets-

We used the following datasets for testing our methods-

Speech Enhancement Methods

Speech enhancement methods could be broadly classified into two categories -

Signal Processing

This used traditional analytical filters which removed noise, either by assuming an additive noise, or by assuming an orthogonal direct sum decomposition of the noise into the clean and the pure noise signals. Statistical techniques like MMSE, MAP, MLE estimation fell into this category as well.

We implemented and tested the following methods-

  • Kalman Filter
  • Wiener Filter
  • Oversubtraction/ Spectral Flooring
  • Bayesian MMSE Filter
  • Bayesian MMSE Log Filter

Deep Learning

These were relatively new and advanced and were based on training. Some popular examples of this were the Facebook Denoiser, SeGAN, and RNN-Noise.

Metrics

We implemented and tested the following metrics-

  • Perceptual Evaluation of Speech Quality (PESQ, narrow and wide band)
  • Short-Time Objective Intelligibility (STOI)
  • F0 Frame Error (FFE)
  • Gross Pitch Error (GPE)
  • Mel Cepstral Distortion (MCD, both versions)
  • Voicing Error Decision (VED)
  • Mean Speech Distortion (MSD)
  • Word Error Rate (WER)

These filter methods worked really well on the NOIZEUS dataset, however, they were not the best when it came to TTS models. The results of applying the filters on the NOIZUES as well as the TTS dataset had been discussed in detail here. Hence, we needed to resort to deep learning methods!

This repo could be used for real-life speech denoisement purposes. Most importantly, it provided implementations of crucial metrics which could be used for measuring the amount of distortion/clarity of the speech.

Installation

To install, we simply cloned the repository and installed the requirements

git clone https://github.com/skit-ai/woc-tts-enhancement
cd woc-tts-enhancement
pip install -r requirements.txt

Here are various links related to the project if you’re interested