Unofficial Pytorch Implementation of WaveGrad2

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

  • More training for WaveGrad-Base setup
  • Checkpoint release
  • WaveGrad-Large Decoder
  • Inference by reduced sampling steps

Requirements

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
  • etc.

We take LJSpeech as an example hereafter.

Preprocessing

  • Adjust preprocess.yaml, especially path section.
path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'
  • run prepare_align.py for some preparations.
python prepare_align.py -c preprocess.yaml
  • Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

  • After that, run preprocess.py.

python preprocess.py -c preprocess.yaml
  • Alternately, you can align the corpus by yourself.
  • Download the official MFA package and run it to align the corpus.
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

or

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
  • And then run preprocess.py.
python preprocess.py -c preprocess.yaml

Training

  • Adjust hparameter.yaml, especially train section.
train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0
  • If you want to train with other dataset, adjust data section in hparameter.yaml
data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  • run trainer.py
python trainer.py
  • If you want to resume training from checkpoint, check parser.
parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()
  • During training, tensorboard logger is logging loss, spectrogram and audio.
tensorboard --logdir=./tensorboard --bind_all

Inference

  • run inference.py
python inference.py -c <checkpoint_path> --text <'text'>

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

  • Normal LSTM without ZoneOut is applied for encoder.
  • g2p_en is applied instead of Google's unknown G2P.
  • Trained with LJSpeech datasdet instead of Google's proprietary dataset.
    • Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
  • MT + SpecAug are not implemented.
  • hyperparameters
    • train.batch_size: 12 for 2 A100 (40GB) GPUs
    • train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
    • train.decay learning rate decay is applied during training
    • train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
    • ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
    • encoder.channel is reduced to 512 from 1024 or 2048
  • Current sample page only contains samples from WaveGrad-Base decoder.
  • TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Special thanks to

References

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

The audio samples on our webpage(TBD) are partially derived from:

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • WaveGrad2 Official Github.io
Owner
MINDs Lab
MINDsLab provides AI platform and various AI engines based on deep machine learning.
MINDs Lab
PyTorch implementation of ICLR 2022 paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 Oral paper PiCO; also see our Project

王皓波 147 Jan 07, 2023
Pytoydl: A toy deep learning framework built upon numpy.

Documents: https://pytoydl.readthedocs.io/zh/latest/ Pytoydl A toy deep learning framework built upon numpy. You can star this repository to keep trac

28 Dec 10, 2022
Generating Digital Painting Lighting Effects via RGB-space Geometry (SIGGRAPH2020/TOG2020)

Project PaintingLight PaintingLight is a project conducted by the Style2Paints team, aimed at finding a method to manipulate the illumination in digit

651 Dec 29, 2022
Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

TorchSeg This project aims at providing a fast, modular reference implementation for semantic segmentation models using PyTorch. Highlights Modular De

ycszen 1.4k Jan 02, 2023
Predicting 10 different clothing types using Xception pre-trained model.

Predicting-Clothing-Types Predicting 10 different clothing types using Xception pre-trained model from Keras library. It is reimplemented version from

AbdAssalam Ahmad 3 Dec 29, 2021
This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

SBEVNet: End-to-End Deep Stereo Layout Estimation This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by D

Divam Gupta 19 Dec 17, 2022
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Pyserini Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse re

Castorini 706 Dec 29, 2022
Resilience from Diversity: Population-based approach to harden models against adversarial attacks

Resilience from Diversity: Population-based approach to harden models against adversarial attacks Requirements To install requirements: pip install -r

0 Nov 23, 2021
Evaluating saliency methods on artificial data with different background types

Evaluating saliency methods on artificial data with different background types This repository contains the relevant code for the MedNeurips 2021 subm

2 Jul 05, 2022
Piotr - IoT firmware emulation instrumentation for training and research

Piotr: Pythonic IoT exploitation and Research Introduction to Piotr Piotr is an emulation helper for Qemu that provides a convenient way to create, sh

Damien Cauquil 51 Nov 09, 2022
Python PID Tuner - Based on a FOPDT model obtained using a Open Loop Process Reaction Curve

PythonPID_Tuner Step 1: Takes a Process Reaction Curve in csv format - assumes data at 100ms interval (column names CV and PV) Step 2: Makes a rough e

6 Jan 14, 2022
Official implementation of the NeurIPS 2021 paper Online Learning Of Neural Computations From Sparse Temporal Feedback

Online Learning Of Neural Computations From Sparse Temporal Feedback This repository is the official implementation of the NeurIPS 2021 paper Online L

Lukas Braun 3 Dec 15, 2021
🥈78th place in Riiid Solution🥈

Riiid Answer Correctness Prediction Introduction This repository is the code that placed 78th in Riiid Answer Correctness Prediction competition. Requ

ds wook 14 Apr 26, 2022
Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

1 Oct 11, 2021
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python = 3.6 , Pytorch

FuxiVirtualHuman 84 Jan 03, 2023
This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 30, 2022
Implementation for the "Surface Reconstruction from 3D Line Segments" paper.

Surface Reconstruction from 3D Line Segments Surface reconstruction from 3d line segments. Langlois, P. A., Boulch, A., & Marlet, R. In 2019 Internati

85 Jan 04, 2023
Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

Ilya Kostrikov 288 Dec 31, 2022
M3DSSD: Monocular 3D Single Stage Object Detector

M3DSSD: Monocular 3D Single Stage Object Detector Setup pytorch 0.4.1 Preparation Download the full KITTI detection dataset. Then place a softlink (or

mumianyuxin 64 Dec 27, 2022