Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Related tags

Deep Learninghifi-ecg
Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.

Quick Links

Setup

Software

Requirements:

  • Python >= 3.6
  • PyTorch v1.8
  • Install dependencies
    git clone https://github.com/facebookresearch/speech-resynthesis.git
    cd speech-resynthesis
    pip install -r requirements.txt

Data

For LJSpeech:

  1. Download LJSpeech dataset from here into data/LJSpeech-1.1 folder.
  2. Downsample audio from 22.05 kHz to 16 kHz and pad
    bash
    python ./scripts/preprocess.py \
    --srcdir data/LJSpeech-1.1/wavs \
    --outdir data/LJSpeech-1.1/wavs_16khz \
    --pad
    

For VCTK:

  1. Download VCTK dataset from here into data/VCTK-Corpus folder.
  2. Downsample audio from 48 kHz to 16 kHz, trim trailing silences and pad
    python ./scripts/preprocess.py \
    --srcdir data/VCTK-Corpus/wav48_silence_trimmed \
    --outdir data/VCTK-Corpus/wav16_silence_trimmed_padded \
    --pad --postfix mic2.flac

Training

F0 Quantizer Model

To train F0 quantizer model, use the following command:

python -m torch.distributed.launch --nproc_per_node 8 train_f0_vq.py \
--checkpoint_path checkpoints/lj_f0_vq \
--config configs/LJSpeech/f0_vqvae.json

Set to the number of availalbe GPUs on your machine.

Resynthesis Model

To train a resynthesis model, use the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/lj_vqvae \
--config configs/LJSpeech/vqvae256_lut.json

Supported Configurations

Currently, we support the following training schemes:

Dataset SSL Method Dictionary Size Config Path
LJSpeech HuBERT 100 configs/LJSpeech/hubert100_lut.json
LJSpeech CPC 100 configs/LJSpeech/cpc100_lut.json
LJSpeech VQVAE 256 configs/LJSpeech/vqvae256_lut.json
VCTK HuBERT 100 configs/VCTK/hubert100_lut.json
VCTK CPC 100 configs/VCTK/cpc100_lut.json
VCTK VQVAE 256 configs/VCTK/vqvae256_lut.json

Inference

To generate, simply run:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--output_dir generations

To synthesize multiple speakers:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--vc \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_multispkr

You can also generate with codes from a different dataset:

python inference.py \
--checkpoint_file checkpoints/lj_cpc100 \
-n 10 \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_vctk_to_lj

Preprocessing New Datasets

CPC / HuBERT Coding

To quantize new datasets with CPC or HuBERT follow the instructions described in the GSLM code.

To parse CPC output:

python scripts/parse_cpc_codes.py \
--manifest cpc_output_file \
--wav-root wav_root_dir \
--outdir parsed_cpc

To parse HuBERT output:

python parse_hubert_codes.py \
--codes hubert_output_file \
--manifest hubert_tsv_file \
--outdir parsed_hubert 

VQVAE Coding

First, you will need to download LibriLight dataset and move it to data/LibriLight.

For VQVAE, train a vqvae model using the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/ll_vq \
--config configs/LibriLight/vqvae256.json

To extract VQVAE codes:

python infer_vqvae_codes.py \
--input_dir folder_with_wavs_to_code \
--output_dir vqvae_output_folder \
--checkpoint_file checkpoints/ll_vq

To parse VQVAE output:

 python parse_vqvae_codes.py \
 --manifest vqvae_output_file \
 --outdir parsed_vqvae

License

You may find out more about the license here.

Citation

@inproceedings{polyak21_interspeech,
  author={Adam Polyak and Yossi Adi and Jade Copet and 
          Eugene Kharitonov and Kushal Lakhotia and 
          Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
  title={{Speech Resynthesis from Discrete Disentangled Self-Supervised Representations}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
}

Acknowledgements

This implementation uses code from the following repos: HiFi-GAN and Jukebox, as described in our code.

AugLiChem - The augmentation library for chemical systems.

AugLiChem Welcome to AugLiChem! The augmentation library for chemical systems. This package supports augmentation for both crystaline and molecular sy

BaratiLab 17 Jan 08, 2023
[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images

Unsupervised Object-Level Representation Learning from Scene Images This repository contains the official PyTorch implementation of the ORL algorithm

Jiahao Xie 55 Dec 03, 2022
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

24 Nov 02, 2022
Scrutinizing XAI with linear ground-truth data

This repository contains all the experiments presented in the corresponding paper: "Scrutinizing XAI using linear ground-truth data with suppressor va

braindata lab 2 Oct 04, 2022
PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

Haoyu Chen 71 Dec 30, 2022
Extending JAX with custom C++ and CUDA code

Extending JAX with custom C++ and CUDA code This repository is meant as a tutorial demonstrating the infrastructure required to provide custom ops in

Dan Foreman-Mackey 237 Dec 23, 2022
Matplotlib Image labeller for classifying images

mpl-image-labeller Use Matplotlib to label images for classification. Works anywhere Matplotlib does - from the notebook to a standalone gui! For more

Ian Hunt-Isaak 5 Sep 24, 2022
Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

mandos 43 Dec 07, 2022
Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks

Adversarially-Robust-Periphery Code + Data from the paper "Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks" by A

Anne Harrington 2 Feb 07, 2022
Few-NERD: Not Only a Few-shot NER Dataset

Few-NERD: Not Only a Few-shot NER Dataset This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset.

THUNLP 319 Dec 30, 2022
A simple implementation of Kalman filter in Multi Object Tracking

kalman Filter in Multi-object Tracking A simple implementation of Kalman filter in Multi Object Tracking 本实现是在https://github.com/liuchangji/kalman-fil

124 Dec 29, 2022
MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift

MemStream Implementation of MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift . Siddharth Bhatia, Arjit Jain, Shivi

Stream-AD 61 Dec 02, 2022
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

ImageBART NeurIPS 2021 Patrick Esser*, Robin Rombach*, Andreas Blattmann*, Björn Ommer * equal contribution arXiv | BibTeX | Poster Requirements A sui

CompVis Heidelberg 110 Jan 01, 2023
Pytorch library for end-to-end transformer models training and serving

Pytorch library for end-to-end transformer models training and serving

Mikhail Grankin 768 Jan 01, 2023
ML models and internal tensors 3D visualizer

The free Zetane Viewer is a tool to help understand and accelerate discovery in machine learning and artificial neural networks. It can be used to ope

Zetane Systems 787 Dec 30, 2022
dualPC.R contains the R code for the main functions.

dualPC.R contains the R code for the main functions. dualPC_sim.R contains an example run with the different PC versions; it calls dualPC_algs.R whic

3 May 30, 2022
This is a Deep Leaning API for classifying emotions from human face and human audios.

Emotion AI This is a Deep Leaning API for classifying emotions from human face and human audios. Starting the server To start the server first you nee

crispengari 5 Oct 02, 2022
NeurIPS 2021, self-supervised 6D pose on category level

SE(3)-eSCOPE video | paper | website Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation Xiaolong Li, Yijia Weng,

Xiaolong 63 Nov 22, 2022
Mail classification with tensorflow and MS Exchange Server (ham or spam).

Mail classification with tensorflow and MS Exchange Server (ham or spam).

Metin Karatas 1 Sep 11, 2021
Code for Learning to Segment The Tail (LST)

Learning to Segment the Tail [arXiv] In this repository, we release code for Learning to Segment The Tail (LST). The code is directly modified from th

47 Nov 07, 2022