Paper Reading on Tacotron: Towards End to End Speech Synthesis

4 min readApr 24, 2020

Abstract

Text to synthesis system typically consists of multiple stages. They are:

i. text analysis frontend

ii. an acoustic model

iii. An audio synthesis modulePaper Reading on Tacotron: Towards End to End Speech Synthesis
Abstract
Text to synthesis system typically consists of multiple stages. They are:

i. text analysis frontend

ii. an acoustic model

iii. An audio synthesis module

Tacotron, and end to end generative text to speech model that synthesizes speech directly from characters. For training the model from scratch a <text, audio> pair is required.

Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods.

Acronyms- TTS(Text To Speech), NMT(Neural Machine Translation)

Introduction

Modern TTS are complex because it has following:

a text front end for extracting various linguistic features
a duration model
an acoustic feature prediction model
a complex signal processing based enoder

(Zen et. al.)

One component is different from other, since each of them are trained individually, errors from each component ma compound. They require substantial human efforts.

Thus, integrated TTS have many advantages since they can be trained on <text, audio> pairs with minimal human annotation.

alleviates the need for feature engineering
works on sentiment
adaptation to new data is easier
single staged model is obtained

TTS outputs are continuous, and output sequences are usually much longer than those of the input. So, these attributes cause prediction errors to accumulate quickly.

So, in this paper, TTS model is based on sequence to sequence(seq2seq) (Sutskever et. al.) with attention paradigm (Bahdanau et. al.).

Inputs: Characters

Outputs: Raw Spectrogram

With a simple waveform synthesis technique, Tacotron produces a 3.82 mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.

Model Architecture:

The backbone of Tacotron is a seq2seq model with attention (Bahdanau et al).

At the surface level, we can see that the model consists of:

I. An encoder

II. An attention based decoder

III. Post processing net

So, model takes characters as inputs, produces spectrogram frames which are then converted to waveforms.

CBHG Module

(1-D convolution bank + highway network + bidirectional GRU)

It is a powerful model for extracting representations from sequences.

Here,

Convolution

The input sequence is first convolved with K sets of 1-D convolutional filters, where the k-th set contains Ck filters of width k (i.e. k = 1, 2, . . . , K).
The convolution outputs are stacked together and further max pooled
Stride is kept 1
further pass processed sequence to fixed width 1D convolutions, whose outputs are added to original inputs via residual connections
Batch Normalization done for all convolutional layers

Multi layer Highway

Now, the convolution outputs are fed to multilayer highway (Optimizing networks increasing their depths) network to extract high level features.

GRU RNN

Once the features are extracted, we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context. The Sequential RNN follows the similar implementation used in NMT (Neural Machine Translation) used by Lee et. al.

Differences from the work of Lee include:

use of non causal convolutions
batch normalisation
residual connections
stride =1 max pooling

These modifications are said to have improved the generalizations of the TTS.

Encoder

Input to the encoder is a character sequence where the character sequence is represented as a one-hot vector which is embedded into context vector.

Then, “PreNet”(Extension of the ResNet where a recurrent layer is further introduced to exploit the dependencies of deep features across the layers.)

Hence, the CBHG module transforms the prenet outputs into the final encoder representation later used by the attention module.

Decoder

The content-based tanh attention decoder is used, query are produced at each decoder time step.

The context vector produced by the Encoder as well as the decoder time steps are concatenated with the attention RNN cell output to form inputs to the next decoder time steps.

GRU are used on the decoder with vertical residual connections (passing the gradients through the network without passing them through the activation functions) that improved rate of convergence.

We use a post-processing network (discussed below) to convert from the seq2seq target to waveform.

A simple fully-connected output layer is used to predict the decoder targets.

An important trick we discovered was predicting multiple, non-overlapping output frames at each decoder step. Predicting r frames at once divides the total number of decoder steps by r, which reduces model size, training time, and inference time. This trick improved the convergence speed.

This is likely because neighboring speech frames are correlated and each character usually corresponds to multiple frames.

Emitting one frame at a time forces the model to attend to the same input token for multiple time steps; emitting multiple frames allows the attention to move forward early in training.