Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Last update: Oct 14, 2022

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

This is an implementation of the paper, along with the pipeline and pretrained model using an open dataset. Audio samples of the paper is available here.

Recipe

This open pipeline uses the Databaker dataset. Please refer to our previous pipeline for dataset preprocessing, while only the Databaker dataset is used. Besides, you need to run lexicon/build_databaker.py to build the vocabulary, download the lexicon from zdic.net, and encode them with XLM-R. Feel free to change the target directory to save the data, which is specified in build_databaker.py and lexicon_utils.py.

Below are the commands to train and evaluate. Default target directories specified in the preprocessing scripts are used, so please substitute them with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

python -m torch.distributed.launch --nproc_per_node=NGPU --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --training_languages=zh-cn --eval_languages=zh-cn --training_speakers=databaker --eval_steps=100000:150000 --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --external_embed=D:\free_corpus\packed\embed.zip --vocab=D:\free_corpus\packed\db_vocab.json

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --eval_languages=zh-cn --eval_meta=D:\free_corpus\packed\metadata.eval.txt --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --start_step=100000 --vocab=D:\free_corpus\packed\db_vocab.json --external_embed=D:\free_corpus\packed\embed.zip --eval_speakers=databaker

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

Pretrained Model

The pretrained models on Databaker are available at OneDrive Link, which reaches a CER of 4.19%. Relevant files necessary for generation of speeches including lexicon texts, lexicon embeddings, the vocabulary file, and evaluation scripts are also included to aid fast reproduction.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Related tags

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Recipe

Pretrained Model

Owner

Mutian He

Material del curso IIC2233 Programación Avanzada 📚

Research code of ICCV 2021 paper "Mesh Graphormer"

Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

Object detection and instance segmentation toolkit based on PaddlePaddle.

[NeurIPS 2020] Code for the paper "Balanced Meta-Softmax for Long-Tailed Visual Recognition"

Binary Stochastic Neurons in PyTorch

Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

Construct a neural network frame by Numpy

Official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right"

Official PyTorch implementation of the paper: DeepSIM: Image Shape Manipulation from a Single Augmented Training Sample

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

A video scene detection algorithm is designed to detect a variety of different scenes within a video

The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing

A python package to perform same transformation to coco-annotation as performed on the image.

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

Restricted Boltzmann Machines in Python.

An end-to-end library for editing and rendering motion of 3D characters with deep learning [SIGGRAPH 2020]

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Repository for the NeurIPS 2021 paper: "Exploiting Domain-Specific Features to Enhance Domain Generalization".