End-2-end speech synthesis with recurrent neural networks

Last update: Dec 07, 2022

Overview

Introduction

New: Interactive demo using Google Colaboratory can be found here

TTS-Cube is an end-2-end speech synthesis system that provides a full processing pipeline to train and deploy TTS models.

It is entirely based on neural networks, requires no pre-aligned data and can be trained to produce audio just by using character or phoneme sequences.

Markdown does not allow embedding of audio files. For a better experience check-out the project's website.

For installation please follow these instructions. Training and usage examples can be found here. A notebook demo can be found here.

Output examples

Encoder outputs:

"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat."

"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco."

Vocoder output (conditioned on gold-standard data)

Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)

original vocoder

End to end decoding

The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.

synthesized original(unseen)

Technical details

TTS-Cube is based on concepts described in Tacotron (1 and 2), Char2Wav and WaveRNN, but it's architecture does not stick to the exact recipes:

It has a dual-architecture, composed of (a) a module (Encoder) that converts sequences of characters or phonemes into mel-log spectrogram and (b) a RNN-based Vocoder that is conditioned on the spectrogram to produce audio
The Encoder is similar to those proposed in Tacotron (Wang et al., 2017) and Char2Wav (Sotelo et al., 2017), but
- has a lightweight architecture with just a two-layer BDLSTM encoder and a two-layer LSTM decoder
- uses the guided attention trick (Tachibana et al., 2017), which provides incredibly fast convergence of the attention module (in our experiments we were unable to reach an acceptable model without this trick)
- does not employ any CNN/pre-net or post-net
- uses a simple highway connection from the attention to the output of the decoder (which we observed that forces the encoder to actually learn how to produce the mean-values of the mel-log spectrum for particular phones/characters)
The initail vocoder was similar to WaveRNN(Kalchbrenner et al., 2018), but instead of modifying the RNN cells (as proposed in their paper), we used two coupled neural networks
We are now using Clarinet (Ping et al., 2018)

References

The ParallelWavenet/ClariNet code is adapted from this ClariNet repo.

End-2-end speech synthesis with recurrent neural networks

Related tags

Overview

Introduction

Output examples

End to end decoding

Technical details

References

Owner

Tiberiu Boros

A flask application to predict the speech emotion of any .wav file.

100+ Chinese Word Vectors 上百种预训练中文词向量

⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Fixes mojibake and other glitches in Unicode text, after the fact.

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Unsupervised Language Modeling at scale for robust sentiment classification

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

A demo of chinese asr

A Flask Sentiment Analysis API, with visual implementation

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Learning Spatio-Temporal Transformer for Visual Tracking

A very simple framework for state-of-the-art Natural Language Processing (NLP)