Speech Enhancement in TTS systems
Google Winter of Code 2022
This was a winter of code research project aimed at speech enhancement of speech generated by text-to-speech models.
The speech generated by many TTS models had some ambient noise and noise-like artifacts. We worked on post-processing to reduce and remove those artifacts. Along with removing the noise, we also wished to quantify the noise and how clear our audio became after we applied our method. So we were also interested in developing metrics for quantifying the speech clarity.
Datasets-
We used the following datasets for testing our methods-
- NOIZEUS dataset
- Skit TTS dataset
Speech Enhancement Methods
Speech enhancement methods could be broadly classified into two categories -
Signal Processing
This used traditional analytical filters which removed noise, either by assuming an additive noise, or by assuming an orthogonal direct sum decomposition of the noise into the clean and the pure noise signals. Statistical techniques like MMSE, MAP, MLE estimation fell into this category as well.
We implemented and tested the following methods-
- Kalman Filter
- Wiener Filter
- Oversubtraction/ Spectral Flooring
- Bayesian MMSE Filter
- Bayesian MMSE Log Filter
Deep Learning
These were relatively new and advanced and were based on training. Some popular examples of this were the Facebook Denoiser, SeGAN, and RNN-Noise.
Metrics
We implemented and tested the following metrics-
- Perceptual Evaluation of Speech Quality (PESQ, narrow and wide band)
- Short-Time Objective Intelligibility (STOI)
- F0 Frame Error (FFE)
- Gross Pitch Error (GPE)
- Mel Cepstral Distortion (MCD, both versions)
- Voicing Error Decision (VED)
- Mean Speech Distortion (MSD)
- Word Error Rate (WER)
These filter methods worked really well on the NOIZEUS dataset, however, they were not the best when it came to TTS models. The results of applying the filters on the NOIZUES as well as the TTS dataset had been discussed in detail here. Hence, we needed to resort to deep learning methods!
This repo could be used for real-life speech denoisement purposes. Most importantly, it provided implementations of crucial metrics which could be used for measuring the amount of distortion/clarity of the speech.
Installation
To install, we simply cloned the repository and installed the requirements
git clone https://github.com/skit-ai/woc-tts-enhancement
cd woc-tts-enhancement
pip install -r requirements.txt
Here are various links related to the project if you’re interested