Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Last update: Dec 05, 2022

Overview

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Abstract: Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive architectures have several limitations: (1) They require a lot of time to generate a mel-spectrogram consisting of hundreds of steps. (2) The autoregressive speech generation shows a lack of robustness due to its error propagation property. In this paper, we propose a novel non-autoregressive TTS model called BVAE-TTS, which eliminates the architectural limitations and generates a mel-spectrogram in parallel. BVAE-TTS adopts a bidirectional-inference variational autoencoder (BVAE) that learns hierarchical latent representations using both bottom-up and top-down paths to increase its expressiveness. To apply BVAE to TTS, we design our model to utilize text information via an attention mechanism. By using attention maps that BVAE-TTS generates, we train a duration predictor so that the model uses the predicted duration of each phoneme at inference. In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality. Furthermore, our BVAE-TTS outperforms Glow-TTS, which is one of the state-of-the-art non-autoregressive TTS models, in terms of both speech quality and inference speed while having 58% fewer parameters. One-sentence Summary: In this paper, a novel non-autoregressive text-to-speech model based on bidirectional-inference variational autoencoder called BVAE-TTS is proposed.

Training

Download and extract the LJ Speech dataset
Make preprocessed folder in the LJSpeech directory and do preprocessing of the data using prepare_data.ipynb
Set the data_path in hparams.py to the preprocessed folder
Train your own BVAE-TTS model

python train.py --gpu=0 --logdir=baseline

Pre-trained models

We provide a pre-trained BVAE-TTS model, which is a model that you would obtain with the current setting (e.g. hyperparameters, dataset split). Also, we provide a pre-trained WaveGlow model that is used to obtain the audio samples. After downloading the models, you can generate audio samples using inference.ipynb.

Audio Samples

You can hear the audio samples here

Reference

1.NVIDIA/tacotron2: https://github.com/NVIDIA/tacotron2
2.NVIDIA/waveglow: https://github.com/NVIDIA/waveglow
3.pclucas/iaf-vae: https://github.com/pclucas14/iaf-vae

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Related tags

Overview

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Training

Pre-trained models

Audio Samples

Reference

Owner

LEE YOON HYUNG

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Understand Text Summarization and create your own summarizer in python

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A retro text-to-speech bot for Discord

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Chinese NER with albert/electra or other bert descendable model (keras)

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

UniSpeech - Large Scale Self-Supervised Learning for Speech

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!