YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Overview

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Audios samples

Visit our website for audio samples.

Implementation

All of our experiments were implemented on the Coqui TTS repo. (Still a PR).

Colab Demos

Demo URL
Zero-Shot TTS link
Zero-Shot VC link

Checkpoints

All the released checkpoints are licensed under CC BY-NC-ND 4.0

Model URL
Speaker Encoder link
Exp 1. YourTTS-EN(VCTK) link
Exp 1. YourTTS-EN(VCTK) + SCL link
Exp 2. YourTTS-EN(VCTK)-PT link
Exp 2. YourTTS-EN(VCTK)-PT + SCL link
Exp 3. YourTTS-EN(VCTK)-PT-FR link
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL link
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL link

Results replicability

To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.

To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.

Comments
  • Languages other than PT, FR, EN

    Languages other than PT, FR, EN

    As YourTTS is multilingual TTS, I think that by training datasets, it seems that other languages might be available. However, YourTTS's checkpoint structure seems distinctive. Is there any training procedures that I can refer?

    opened by papercore-dev 7
  • Issue with Input type and weight type should be the same

    Issue with Input type and weight type should be the same

    Hi,

    I am trying to train YourTTS on my own dataset. So I followed your helpful guide with the latest stable version of Coqui TTS (0.8.0).

    After computing the embeddings (on GPU) without issue, I run into this RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same.

    I have already trained a VITS model with this dataset so everything is already set up. I understood that input Tensor resides on GPU whereas weight Tensor resides on CPU but how can I solve this ? Should I downgrade to CoquiTTS 0.6.2 ?

    Here is the full traceback :

    File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1533, in fit
        self._fit()
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1517, in _fit
        self.train_epoch()
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1282, in train_epoch
        _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1135, in train_step
        outputs, loss_dict_new, step_time = self._optimize(
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 996, in _optimize
        outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 954, in _model_train_step
        return model.train_step(*input_args)
      File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1250, in train_step
        outputs = self.forward(
      File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1049, in forward
        pred_embs = self.speaker_manager.encoder.forward(wavs_batch, l2_norm=True)
      File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/resnet.py", line 169, in forward
        x = self.torch_spec(x)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
        input = module(input)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/base_encoder.py", line 22, in forward
        return torch.nn.functional.conv1d(x, self.filter).squeeze(1)
    RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
    

    Thanks for helping me out!

    opened by Ca-ressemble-a-du-fake 6
  •  Speaker Encoder train on new language

    Speaker Encoder train on new language

    Hi, Can you elaborate about the source of where you get Speaker Encoder, and how do you train it with additional languages? How do you use model Wav2Vec which trained from fairseq? on config_se.json "run_description": "resnet speaker encoder trained with commonvoice all languages dev and train, Voxceleb 1 dev and Voxceleb 2 dev". Which languages include in this CV? which version of CV in this training? Thanks.

    opened by ikcla 5
  • YourTTS_zeroshot_VC_demo.ipynb

    YourTTS_zeroshot_VC_demo.ipynb

    Hi! I am trying to run YourTTS_zeroshot_VC_demo.ipynb and there seems to be access changes to the file best_model.pth.tar I am downloading it right now and I will manually upload it, so that I can run the notebook, but could you kindly fix the access rights so that others could easily run it like it was before. Thank you in advance! image

    opened by stalevna 5
  • train our own voice model

    train our own voice model

    Hi ,

    I have found your repo very interesting. So, I am trying out this. I am curious to know about training our voice files to creating checkpoint without involvement of text(As i have seen in previous issues to take reference of coqui model training) and without altering config.json. Can you please guide us how to proceed on this further.

    opened by chandrakanthlns 4
  • Train YourTTS on another language

    Train YourTTS on another language

    Good day!

    I have several questions, could you please help?

    Do I understand correctly that if I want to train the model on another language it is better to fine tune this model (YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL): https://drive.google.com/drive/folders/15G-QS5tYQPkqiXfAdialJjmuqZV0azQV Or it is better to use other checkpoints.

    How many hours of audio is needed to have appropriate quality?

    I planned to use Common Voice Corpus to fine-tune the model on a new language, however, the audio format is mp3 not wav. Do I need to convert all the audio files or I can use mp3 format. If yes, how?

    Thank you for your time in advance!

    opened by annaklyueva 4
  • Select Speakers for Zero Shot TTS

    Select Speakers for Zero Shot TTS

    Hi ,

    Firstly great work on the project with time trying to understand the repo with more clarity. Wanted to know how can I select different speakers for different sections of text .

    Thanks in advance.

    opened by dipanjannC 4
  • From which version does coqui TTS starts supporting voice conversions and cloning?

    From which version does coqui TTS starts supporting voice conversions and cloning?

    Hi @Edresson, I am fairly new into the feild so please forgive for naive question. I am trying to use voice cloning feature. I trained a model on coqui-ai version 0.6 and in that installed environment. And I am using the command below to get the cloning done but it gives error that tts command does not expect "reference_wav" tts --model_path trained_model/best_model.pth.tar --config_path trained_model/config.json --speaker_idx "icici" --out_path output.wav --reference_wav target_content/asura_10secs.wav which might be because it did not support voice conversion then. Can you please confirm? Also, the model trained on version 0.6 doesn't run with latest version and ends up in dimension mismatch error which I am assuming due to model structure change probably. Please shed some light on this, It'll be really helpful.

    opened by tieincred 3
  • finetune VC on my voice

    finetune VC on my voice

    I would like to finetune yourTTS voice conversion on my own voice, and compare it to the zero-shot model. Could you provide the finetuning procedure for VC?

    opened by odeliazavlianovSC 3
  • Exp 1. YourTTS-EN(VCTK) + SCL(speaker encoder layers are not initialized )

    Exp 1. YourTTS-EN(VCTK) + SCL(speaker encoder layers are not initialized )

    I tried to run an experiment similar to Exp 1. YourTTS-EN(VCTK) + SCL initializing use_speaker_encoder_as_loss=true, speaker_encoder_loss_alpha=9.0, speaker_encoder_config_path and speaker_encoder_model_path(downloaded them from your google disk

    So my config file is almost identical to the one you have for the experiment(I don't have fine_tuning_mode=0, but I checked and 0 means disabled, so it shouldn't affect anything. Also use_speaker_embedding=false, otherwise it complains that vectors are initialized)

    My problem is when I print out model weights keys of your model and mine I have speaker encoder layers missing. They are not initialized for some reason. Unfortunately, I don't have any ideas why this could be happening :( Could you maybe point out a direction and what could I check?

      "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 0.8,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": true,
        "num_speakers": 97,
        "speakers_file": null,
        "d_vector_file": "../speaker_embeddings/new-SE/VCTK+TTS-PT+MAILABS-FR/speakers.json",
        "speaker_embedding_channels": 512,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/config.json",
        "speaker_encoder_model_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/converted_checkpoint.pth.tar",
        "fine_tuning_mode": 0,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false
    
    opened by stalevna 3
  • Zeroshot TTS notebook no longer working

    Zeroshot TTS notebook no longer working

    Hi @Edresson @WeberJulian

    the demo notebook is no longer working with the current TTS master repo.

    I'm having hard time to execute things.

    Do you intend to adjust ? thanks

    opened by vince62s 3
Owner
Edresson Casanova
Computer Science PhD Student
Edresson Casanova
Gesture-Volume-Control - This Python program can adjust the system's volume by using hand gestures

Gesture-Volume-Control This Python program can adjust the system's volume by usi

VatsalAryanBhatanagar 1 Dec 30, 2021
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Embracing Single Stride 3D Object Detector with Sparse Transformer

SST: Single-stride Sparse Transformer This is the official implementation of paper: Embracing Single Stride 3D Object Detector with Sparse Transformer

TuSimple 385 Dec 28, 2022
Author: Wenhao Yu ([email protected]). ACL 2022. Commonsense Reasoning on Knowledge Graph for Text Generation

Diversifying Commonsense Reasoning Generation on Knowledge Graph Introduction -- This is the pytorch implementation of our ACL 2022 paper "Diversifyin

DM2 Lab @ ND 61 Dec 30, 2022
Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

Degree-Quant This repo provides a clean re-implementation of the code associated with the paper Degree-Quant: Quantization-Aware Training for Graph Ne

35 Oct 07, 2022
Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022
Computer Vision Script to recognize first person motion, developed as final project for the course "Machine Learning and Deep Learning"

Overview of The Code BaseColab/MLDL_FPAR.pdf: it contains the full explanation of our work Base Colab: it contains the base colab used to perform all

Simone Papicchio 4 Jul 16, 2022
A PyTorch library for Vision Transformers

VFormer A PyTorch library for Vision Transformers Getting Started Read the contributing guidelines in CONTRIBUTING.rst to learn how to start contribut

Society for Artificial Intelligence and Deep Learning 142 Nov 28, 2022
yolov5 deepsort 行人 车辆 跟踪 检测 计数

yolov5 deepsort 行人 车辆 跟踪 检测 计数 实现了 出/入 分别计数。 默认是 南/北 方向检测,若要检测不同位置和方向,可在 main.py 文件第13行和21行,修改2个polygon的点。 默认检测类别:行人、自行车、小汽车、摩托车、公交车、卡车。 检测类别可在 detect

554 Dec 30, 2022
Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, and finding their unique parameters (e.g. death rate).

DINN We introduce Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, a

19 Dec 10, 2022
Neurolab is a simple and powerful Neural Network Library for Python

Neurolab Neurolab is a simple and powerful Neural Network Library for Python. Contains based neural networks, train algorithms and flexible framework

152 Dec 06, 2022
TuckER: Tensor Factorization for Knowledge Graph Completion

TuckER: Tensor Factorization for Knowledge Graph Completion This codebase contains PyTorch implementation of the paper: TuckER: Tensor Factorization f

Ivana Balazevic 296 Dec 06, 2022
Official PyTorch implementation of the ICRA 2021 paper: Adversarial Differentiable Data Augmentation for Autonomous Systems.

Adversarial Differentiable Data Augmentation This repository provides the official PyTorch implementation of the ICRA 2021 paper: Adversarial Differen

Manli 3 Oct 15, 2022
[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

NEAT: Neural Attention Fields for End-to-End Autonomous Driving Paper | Supplementary | Video | Poster | Blog This repository is for the ICCV 2021 pap

254 Jan 02, 2023
Generic image compressor for machine learning. Pytorch code for our paper "Lossy compression for lossless prediction".

Lossy Compression for Lossless Prediction Using: Training: This repostiory contains our implementation of the paper: Lossy Compression for Lossless Pr

Yann Dubois 84 Jan 02, 2023
Artificial intelligence technology inferring issues and logically supporting facts from raw text

개요 비정형 텍스트를 학습하여 쟁점별 사실과 논리적 근거 추론이 가능한 인공지능 원천기술 Artificial intelligence techno

6 Dec 29, 2021
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks,

Google Research 367 Jan 09, 2023
PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation.

PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation. Warning: the master branch might collapse. To ob

559 Dec 14, 2022
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
constructing maps of intellectual influence from publication data

Influencemap Project @ ANU Influence in the academic communities has been an area of interest for researchers. This can be seen in the popularity of a

CS Metrics 13 Jun 18, 2022