Spectrogram Feature prediction network

In this section, we will cover the overall architecture of the feature prediction network and show explicitly how the model works for both training and synthesis times.

The Feature prediction network is a sequence-to-sequence model with attention. Encoder, Decoder and Attention mechanism are all customized compared to a basic sequence-to-sequence architecture composed only of recurrent layers.

Encoder-Decoder Architecture:

In a sequence-to-sequence scenario, where inputs and outputs are both sequences, encoder-decoder architecture is usually the way to go for Neural Networks.

In the original encoder-decoder framework, an encoder reads an input sequence (or sentence) and compresses it to a "thought vector" of fixed length. The decoder is then trained to predict the output sequence one step at a time using the hidden thought vector. The most common approach is to use RNNs for both the encoder and the decoder.

Due to the limited amount of compressed information the encoder can store in a fixed length context vector, Attention mechanism was introduced by Bahdanau et Al.. Typically, attention is an attempt to free the encoder from the burden of summarizing the entire input sequence in a fixed length context vector. Instead, the encoder maps the input sequence to a series of annotations with the same length as the input sequence and the decoder will learn to "Attend" to different parts of these annotations while generating the output sequence. We will discuss this main concept in-depth a bit later and especially in the Attention section.

Encoder:

In Tacotron-2, the encoder maps an input sentence $X = \{X_1, X_2, ..., X_{T_x}\}$ to a series of annotations also called encoder hidden states $H = \{H_1, H_2, ..., H_{T_x}\}$ . Note that both sequences have the exact same length. i.e: The encoder outputs a series of annotations, where each annotation contains information about the equivalent token in input sequence with respect to its neighboring tokens $X_j; \forall j\neq i$ .

In T2, each input token is equivalent to a character. T2 encoder is a block of 3 Convolutional Layers followed by a bi-directional LSTM layer. Just like for image processing, convolutional layers are useful to represent local correlations between inputs (local correlations between characters for our case) by extracting feature maps useful for RNN modeling of sequences.i.e: The use of these convolutional layers gives T2 gives the model a long-term context like N-grams. For example, consider these words: Floor and Flour.

Despite having the same first letters, Floor and Flour are pronounced very differently, and in order to pronounce them correctly, we humans take into consideration the u in Flour. At this point, one might say that using an RNN should be enough to capture such details since RNN have proved to be effective at capturing time correlation in data. This is probably True for small-range correlations since it's in practice a little hard to capture long term dependency. The use of convolutions also makes the model more robust to words with silent characters (e.g: "k" in "know" or "d" in "django". I had to do this one, I just had to :) ).

In order to capture a longer-term context, instead of directly feeding the embedded characters to some RNN, we first extract N-grams using convolutional layers and feed the output features to a RNN. By feeding n-grams to the RNN, T2 encoder is able to better capture the long term dependency between characters of input sequences. Tacotron-2 is thus even capable of capturing sentence context to differentiate past from present forms of words. ("reads" vs "has read")

From a mathematical point of view, consider we have an input sequence from which we extract encoder outputs :

$X = (X_{1},...,X_{T_{x}}), X_{i} \in \mathbb{R}^{K_{x}}$

$H = (H_{1},...,H_{T_{x}}), H_{i} \in \mathbb{R}^{K_{x}}$

where ${K_{x}}$ is the vocabulary size (space of available tokens: characters + symbols) of input tokens and $T_{x}$ is the length of the input sequence.

We first compute the encoder convolutional features $f_{e}$ by convolving embedded character inputs with the consecutive convolution filters $F_{1}$ , $F_{2}$ and $F_{3}$ and applying a relu non linearity at each convolutional layer. ( $\overline{E}$ stands for embeddings).

$f_{e, i} = Relu(F_{3} \ast Relu(F_{2} \ast Relu(F_{1} \ast \overline{E}X)))$

Next these features are fed to a bi-directional LSTM Cell to generate the hidden encoder outputs :

$H = EncoderRecurrency(f_{e})$

The use of bi-directional RNN ensures that the model "reads" the input sequence not only from left to right, but also from right to left, which gives it information about the "past and future" of each input character. In this way, the annotation $H_{j}$ contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation $H_{j}$ will be focused on the words around $X_{j}$ .

A bidirectional RNN is consisted of two independant RNNs, a forward RNN $\overrightarrow{f}$ which reads the input sequence in order (from $f_{1}$ to $f_{T_{x}}$ ) and a backward RNN $\overleftarrow{f}$ which reads the input sequence in reverse order (from $f_{T_{x}}$ to $f_{1}$ ). Their respective outputs are $(\overrightarrow{H_{1}},...,\overrightarrow{H_{T_{x}}})$ and $(\overleftarrow{H_{1}},...,\overleftarrow{H_{T_{x}}})$ .

The final encoder outputs $H = (H_{1},...,H_{T_{x}})$ are the concatenation of the forward and backward hidden states:

$H_{i} = \begin{bmatrix} \overrightarrow{H_{i}}\\ \overleftarrow{H_{i}} \end{bmatrix}$

$\overrightarrow{f}$ and $\overleftarrow{f}$ functions are LSTM Cells which can be described as follows:

where $h_{t}$ is the hidden state at time t, $C_{t}$ is the cell state at time t, $X_{t}$ is the hidden state of the previous layer at time t or $input_{t}$ for the first layer, and $i_{t}$ , $f_{t}$ , $c_{t}$ , $o_{t}$ are the input, forget, cell, and out gates, respectively. σ is the sigmoid function and green squares represent linear transformations followed by the equivalent activation function.

Mathematically speaking, LSTM Cells can be expressed as follows:

$i_{t} = \sigma(W_{ii}x_{t} + b_{ii} + W_{hi}h_{(t-1)} + b_{hi})$

$f_{t} = \sigma(W_{if}x_{t} + b_{if} + W_{hf}h_{(t-1)} + b_{hf})$

$c_{t} = tanh(W_{ic}x_{t} + b_{ic} + W_{hc}h_{(t-1)} + b_{hc})$

$o_{t} = \sigma(W_{io}x_{t} + b_{io} + W_{ho}h_{(t-1)} + b_{ho})$

$C_{t} = f_{t}C_{(t-1)} + i_{t} c_{t}$

$h_{t} = o_{t}\ tanh(C_{t})$

In T2, once the encoder outputs (hidden states) are generated, we feed them to an attention network that consumes these hidden states to generate the context vector.

In contrast to Tacotron-1 original encoder made of a prenet+CBHG block, T2 encoder can be thought of as a large scale **simplification ** of its predecessor. If we look closer, T2 encoder is typically an over-simplified CBHG block where we swap the convolution bank, projections and the highwaynet with a stack of 3 simple convolutions with relu activations. Experiments on switching T2 encoder with T1 encoder and vice-versa show very similar results which suggests that T2 encoder allows for the same performance capabilities with less computation complexity.

Attention Mechanism:

Let's start by discussing Attention in Encoder-Decoder architectures in general.

When introduced by Bahdanau et Al.(2015), attention was mainly explained as a direct link between encoder-decoder inputs and outputs. In other words, Attention is a mechanism meant to link decoder outputs to the encoder outputs. Since the output sequence is generated one output at a time, attention would allow the decoder to "attend" or "pay attention" to parts of the inputs that are most relevant for the present output. It's always easier to understand with diagrams, especially since the attention mechanism introduced by Bahdanau is very intuitive and easy to visualize:

As illustrated in this figure, which is a neural machine translation example, Attention (Alignments) is a direct link between generated decoder outputs and encoder outputs, which in a sense, gives a correlation between outputs and inputs of the encoder-decoder network. Orange Colored cells represent the highest values from the alignment matrix. In the example given in the previous figure, one might notice that "Cat" is linked with "Chat", which is the French word for Cat.

By adding a small Attention Network, which is typically a parallel multilayer feed forward network, the decoder learns not only to predict outputs one at a time, but also learns to align outputs with corresponding inputs. By learning the attention, the model is no longer forced to decode sequences using a small compressed fixed length thought vector as the traditional approach is very ineffective for long sequences (In our T2 case, without attention mechanism, the model wouldn't read long sentences correctly).

Let's have some in-depth look at these attention mechanisms through its evolution through history until we reach the custom Attention we use in Tacotron-2. Please note that, due to the context of our work, I will only present Bahdanau (additive) style attention. For the interested, you can find Luong (multiplicative) style attention explained here.

Content based attention:

In order for the decoder to map each decoding step to an input token, we need to compute some context vector , also called attention vector at each decoding step. We will start calling the context vector computed at step i $c_{i}$ .

In the original Attention introduced by Bahdanau et Al., the context vector $c_{i}$ is computed as weighted sum of the encoder annotations $h_{j}$ or as previously named $H_{j}$ .

Mathematically speaking:

$c_{i} = \sum_{j=1}^{T_{x}} \alpha_{ij} h_{j}$ .

where $\alpha_{ij}$ is called alignments or attention weights.

From the above equation, we can instantly notice that the context vector is computed with respect to all encoder outputs in an attempt to determine the most important ones. The question that remains is, how do we compute these alignments.

Here is where different attention mechanisms start diverging. We sometimes hear about soft or hard attention, or even monotonic attention. In this documentation I will only talk about soft attention.

As suggested by the name, we compute soft alignments between decoder outputs and encoder outputs. With soft alignments comes softmax that we use in the computation of attention weights:

$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_{x}}exp(e_{ik})}$

where $e_{ij}$ is a score usually called energy. The computation of this energy also differs between Bahdanau and Luong works, and even between different types of attention (content-based, location-based and hybrid attention).

Let's start with the most basic one, the content-based attention:

$e_{ij} = score(s_{i-1}, h_{j}) = v{_{a}}^{T}\ tanh(W_{a} s_{i-1} + U_{a} h_{j})$

where $s_{i-1}$ is the decoder hidden states (outputs) from previous time step, $h_{j}$ is the encoder hidden state j, $W_{a}$ , $U_{a}$ and $v_{a}$ are the weights matrices to be learned. Since $U_{a} h_{j}$ is independent from the decoding step i, we can pre-compute it in advance.

The content based attention proved to give the ability to link different outputs with the corresponding input tokens independently of their location. In our example of Tacotron-2, using the content based attention would teach the model to look at most if not all "s" characters when outputting Mel spectrogram frames corresponding to an "s". This is obviously not how we humans read sentences so it's most probably not the best way to go.

Location based attention used by Alex Graves (2013):

The only difference between location-based attention and content-based attention is the way the network does the scoring $e_{ij}$ :

$e_{ij} = score(\alpha_{i-1}, h_{j}) = v{_{a}}^{T}\ tanh(W h_{j} + U f_{i,j})$

where $f_{i,j}$ are location features computed by convolving previous alignments $\alpha_{i-1}$ with convolution filters F (passing previous alignments through a convolutional layer). $v{_{a}}$ , W, U and F are weights to be learned.

The location-based attention doesn't care at all about the content of the input tokens but only care about their locations and the distances that exist between these tokens. With this limitation, one can expect it to be hard to correctly predict distances between consecutive phonemes and their corresponding input tokens without the decoder hidden states as source of information. One can take the example of a silence in the middle of the output sequence due to an "," token in the input sequence. A location-based attention might skip that silence or reduce it because it has no awareness of the content of previous decoder steps, but only keeps track of where it looked before and the text information.

Hybrid attention:

The hybrid attention as its name suggests, is a mix of the two previously discussed attention mechanisms. Like both of them, the only difference is in the computation of the energy:

$e_{ij} = score(s_{i-1}, \alpha_{i-1}, h_{j}) = v{_{a}}^{T}\ tanh(W s_{i-1} + V h_{j} + U f_{i, j})$

where $s_{i-1}$ are previous decoder RNN hidden states, $\alpha_{i-1}$ are previous alignments and $h_{j}$ is the j-th encoder hidden state. W, V, U and $v{_{a}}$ are the trainable parameters. Also $f_{i, j}$ are the location features:

$f_{i} = F \ast \alpha_{i-1}$

Usually, we keep all the Linear transformations in the score computation bias free, we can however add biases, which makes the attention computation:

$e_{ij} = v{_{a}}^{T}\ tanh(W s_{i-1} + b_{1} + V h_{j} + b_{2} + U f_{i, j} + b_{3})$

Mathematically speaking, it is perfectly possible to combine the 3 bias terms in a single b vector, reducing the number of model parameters, and the model will learn to attribute values corresponding for the summed biases. The final energy computation becomes then:

$e_{ij} = v{_{a}}^{T}\ tanh(W s_{i-1} + V h_{j} + U f_{i, j} + b)$

The hybrid attention takes in consideration both the content and the location of inputs tokens, hopefully getting best from both previous attentions and beating their limitations. Due to the better outcome of the Hybrid attention, we build the Tacotron-2 "Location Sensitive Attention" based on the Hybrid attention.

Tacotron-2 Attention:

Finally let's discuss the attention used in T2 which is the "Location Sensitive Attention". By reference to "We use the location-sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature." which is extracted from the Tacotron-2 paper we can assume that the context vector computation is similar to all previously seen attentions, and the only difference is in the computation of the energy. Here is our interpretation to the paper and which proved to give reasonable results since very early stages of training:

We compute our attention as follows:

$e_{ij} = score(s_{i}, c\alpha_{i-1}, h_{j}) = v{_{a}}^{T}\ tanh(W s_{i} + V h_{j} + U f_{i, j} + b)$

where we use $s_{i}$ which are the current decoding step RNN hidden state instead of previous one's (reason explained down below in the decoder section). We initialize the bias b to a vector of zeros, and expect the model to only change them if needed, else they will automatically converge to values close to 0. The location features are this time calculated using the cumulated alignments $c\alpha_{i-1}$ as follows:

$f_{i} = F \ast c\alpha_{i-1}$

$c\alpha_{i-1} = \sum_{j=1}^{i-1}\alpha_{j}$

As illustrated mathematically, we supposed that cumulation of alignments is done additively. One might wonder why we chose to use an additive cumulation, why not a multiplicative one? At first, it was a choice due to pure intuition, but here is a diagram that shows that for a perfect alignment case, multiplicative cumulation is prone to loosing location information we're seeking:

With such cumulated previous alignments, we give the attention network an information about input characters it already attended to, which it can use to constantly keep moving forward in the sequence and avoid repeating undesirable voices.

So to wrap it up, to link the decoder outputs directly to the input tokens, we add few Linear / convolutional connections that the model uses to learn alignments. Alignments are then pushed against encoder inputs to determine the context vector which gives a good information about the most relevant token in the input sequence. Here's a simple diagram to wrap the process:

Decoder:

In this section, we cover the details of the Tacotron-2 decoder, explain how Attention, hidden states and outputs are computed at each decoding step. The actions at each decoding step can be summarized with the following chart:

A decoding step starts with feeding the previous output spectrogram frame (during synthesis) or the ground truth of the previous spectrogram frame (during training) to the Pre-Net layers. In the previous chart, synthesis specific actions are in blue while training specific actions are in red.

The outputs of the prenet are then concatenated to the context vector computed in the previous decoding step, and the whole is fed as input to the RNN decoder. This is usually called "input feeding". We will come back on that in a moment.

The output of the decoder RNN are used as query vector for the computation of the new context vector (remember when we said that we use the current decoder hidden state for the computation of the context vector?). We will also explain this choice in a second.

Finally, the newly computed context vector is concatenated to the decoder outputs (hidden states) and fed to projections layers to predict the output (spectrogram frame) and the <stop_token> probability (which determines if the decoding should stop or not).

Most of the upcoming explanation is inspired by this tensorflow tutorial which also exists on the Tensorflow official website. As stated in the linked tutorial, there are many different attention variants. These variants depend on the form of the scoring function and the attention function, and on whether the previous state $h_{i-1}$ is used instead of $h_{i}$ in the scoring function as originally suggested in (Bahdanau et al., 2015). Empirically, we found that only certain choices matter. First, the basic form of attention, i.e., direct connections between target (output) and source (input), needs to be present. Second, it's important to feed the attention (context) vector to the next timestep (input feeding) to inform the network about past attention decisions as demonstrated in (Luong et al., 2015). Lastly, choices of the scoring function can often result in different performance.

After having an in-depth look at Tensorflow's attention_wrapper.py I chose to adopt a context vector computation "ala Luong" where the context vector is computed after the prediction of new hidden state $s_{i}$ but before generating the output $y_{i}$ . We also feed the previous context vector $c_{i-1}$ to the decoder when computing the new hidden states to ensure the model knows which attention it attributed at the last decoding step. This is called the "input feeding" process.

Projection layers are common in decoders to project RNN hidden states to output space. As for the prenet, T2 paper authors mentioned that the prenet as information bottleneck was essential for learning attention. the "information bottleneck" term refers to the idea of transferring back the previous mel-spectrogram frame to a hidden representation level that can be interpreted by the decoder-RNN. Think about this as having two distinct representation spaces where one of them is the frequencies space used in the mel-spectrograms. The other is an abstract space that holds the network vision of these mel frames. The prenet holds the purpose of translating back the frames from frequency space to the hidden abstract space so that the RNN is able to predict the next frames.

A side information to keep in mind: In our implementation, we support the use of a reduction factor that typically allows the decoder to predict r (reduction factor) Mel spectrogram frames in one decoding step. This proved to speed up computation and reduce memory usage. T2 paper states using r=1, we also experimented with higher reduction factors and we drew the following conclusions:

Aside from speeding up computation (train/test times) and reducing memory usage, r=2 or r=3 allow for better alignments as they enable each decoding step to approximately model larger portions of a phoneme (75~100ms) allowing for more certain link between frames and text.
Acoustic quality (voice clarity) however, when using Griffin-Lim, proved to work better with r=1. This is due to the fact that generated spectrograms with r=1 tend to have more high frequency details than those with higher reduction factors.
For WaveNet purposes, as long as training is made in GTA mode, r=1, r=2 or r=3 makes no difference whatsoever on acoustic quality.
Reduction factor does not necessarily impact prosody quality much, but we did notice that r=1 has a slightly less flat speech than with others.
For final pretrained models we provide, we use r=1 to ensure that both our GL and WaveNet T2 models have a decent acoustic and prosody quality.

So, from a mathematical point of view, the i-th decoder step can be seen as the following series of equations:

$py_{i} = Prenet(y_{i-1}) = Relu(W_{2}\ Relu(W_{1}y_{i-1} + b_{1}) + b_{2})$

$s_{i} = LSTM(LSTM(s_{i-1}, py_{i}, c_{i-1}))$ (*)

$y_{i} = Linear([s_{i}; c_{i}]) = W_{p} [s_{i}; c_{i}] + b_{p}$

$y_{s, i} = SigmLinear([s_{i}; c_{i}]) = \sigma(W_{s} [s_{i}; c_{i}] + b_{s})$

(*): Refer to Encoder section for LSTM outputs computation.

Post-Net:

Last but not least, once the decoder is done decoding, the predicted Mel spectrogram is passed through a stack of convolutional layers with tanh non linearities to improve the overall output quality. This is probably due to the ability of convolutional layers to capture features in images or sequences especially that they see both past and future context in already decoded spectrograms. These convolutions are followed by a final linear projection layer to the output space.

Mathematically, the Post-Net output, called "residual", can be computed with:

$y_{r} = PostNet(y) = W_{ps} f_{ps} + b_{ps}$

where $f_{ps}$ , the post net convolutional features are computed as a stack of 5 convolutions with tanh activations:

$F_{ps, i} \ast x$

with x the output of previous convolutional layer or decoder output for the first Post-Net layer.

Finally, to get the final features of the Spectrogram prediction network, we sum the decoder spectrogram outputs with the residual to get the final outputs:

$y_{final} = y + y_{r}$

Post-Processor-Net: (Optional)

Optionally, we add another post processing network used to predict linear scale spectrograms from their respective mel spectrograms.

This small network is simply a CBHG block like the ones used in Tacotron-1 followed by a linear projection layer.

$y_{final\_lin} = W_{post\_proc} CBHG(y_{final}) + b_{post\_proc}$

Training:

In order to train the model, we use the summed mean squared error (MSE) from before and after the post-net to aid convergence. i.e: We minimize the following loss:

$\epsilon = \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{i})^{2} + \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{final, i})^{2}$

where n is the number of samples in the batch.

Optionally, if we choose to make linear spectrogram predictions, we define the linear loss base on the first norm:

$l1 = \frac{1}{n} \sum_{i=1}^{n} \left | y_{real, i} - y_{final\_lin, i} \right |$

We thus define our linear loss as:

$lin\_loss = \frac{1}{2} l1 + \frac{1}{2} l1_{low\_freq}$

where $l1_{low\_freq}$ is the same first norm but that only computes loss of low frequency components of the spectrogram. The linear loss **encourages the model into optimizing ** low frequency parts of the spectrogram prior to optimizing high frequency information reconstruction as low frequencies hold more speech information than high frequencies.

We also apply an L2 regularization with a weight of $\lambda$ which makes the final loss function:

$loss = \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{i})^{2} + \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{final, i})^{2} + \lambda \sum_{j=1}^{p}w_{j}^2$

or optionally (when predicting linear spectrograms as well):

${\centering loss = \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{i})^{2} + \frac{1}{n}\sum_{i=1}^{n}(y_{real, i} - y_{final, i})^{2} + \frac{1}{2n} \sum_{i=1}^{n}\left | y_{real, i} - y_{final\_lin, i} \right | + \frac{1}{2n} \sum_{i=1}^{n}\left | y_{real\_low, i} - y_{final\_lin\_low, i} \right | + \lambda \sum_{j=1}^{p}w_{j}^2}$

where $\lambda$ is the regularization weight and p is the total number of weights in the network (filters are considered weights). Note that we don't regularize biases.

We minimize this loss using the Adam optimizer to update the network parameters with the same training hyper-parameters described in our hparams file.

References and resources:

Thank you for reading until the end. If you have any suggestions or questions, I am always at your disposal. Also feel free to check out the other wikis of this repository!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly