An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Overview

CPC_audio

This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper Unsupervised Pretraining Transfers well Across Languages. This is an unsupervised method to train audio features directly from the raw waveform.

Moreover, this code also implements all the evaluation metrics used in the paper:

Setup instructions

The installation is a tiny bit involved due to the torch-audio dependency.

0/ Clone the repo: git clone [email protected]:facebookresearch/CPC_audio.git && cd CPC_audio

1/ Install libraries which would be required for torch-audio https://github.com/pytorch/audio :

  • MacOS: brew install sox
  • Linux: sudo apt-get install sox libsox-dev libsox-fmt-all

2/ conda env create -f environment.yml && conda activate cpc37

3/ Run setup.py python setup.py develop

You can test your installation with: nosetests -d

CUDA driver

This setup is given for CUDA 9.2 if you use a different version of CUDA then please change the version of cudatoolkit in environment.yml. For more information on the cudatoolkit version to use, please check https://pytorch.org/

Standard datasets

We suggest to train the model either on Librispeech or libri-light.

How to run a session

To run a new training session, use:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION

Where:

  • $PATH_AUDIO_FILES is the directory containing the audio files. The files should be arranged as below:
PATH_AUDIO_FILES  
│
└───speaker1
│   └───...
│         │   seq_11.{$EXTENSION}
│         │   seq_12.{$EXTENSION}
│         │   ...
│   
└───speaker2
    └───...
          │   seq_21.{$EXTENSION}
          │   seq_22.{$EXTENSION}

Please note that each speaker directory can contain an arbitrary number of subdirectories: the speaker label will always be retrieved from the top one. The name of the files isn't relevant. For a concrete example, you can look at the organization of the Librispeech dataset.

  • $PATH_CHECKPOINT_DIR in the directory where the checkpoints will be saved
  • $TRAINING_SET is a path to a .txt file containing the list of the training sequences (see here for example)
  • $VALIDATION_SET is a path to a .txt file containing the list of the validation sequences
  • $EXTENSION is the extension of each audio file

Custom architectures

The code allows you to train a wide range of architectures. For example, to train the CPC method as described in Van Den Oord's paper just run:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --normMode batchNorm --rnnMode linear

Or if you want to train a model with a FFD prediction network instead of a transformer:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --rnnMode ffd --schedulerRamp 10

The --schedulerRamp option add a learning rate ramp at the beginning of the training: it barely affects the performance of a model with a transformer predictor but is necessary with other models.

Launch cpc/train.py -h to see all the possible options.

How to restart a session

To restart a session from the last saved checkpoint just run

python cpc/train.py --pathCheckpoint $PATH_CHECKPOINT_DIR

How to run an evaluation session

All evaluation scripts can be found in cpc/eval/.

Linear separability:

After training, the CPC model can output high level features for a variety of tasks. For an input audio file sampled at 16kHz, the provided baseline model will output 256 dimensional output features every 10ms. We provide two linear separability tests one for speaker, one for phonemes, in which a linear classifier is trained on top of the CPC features with aligned labels, and evaluated on a held-out test set.

Train / Val splits as well as phone alignments for librispeech-100h can be found here.

Speaker separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT

Phone separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --pathPhone $PATH_TO_PHONE_LABELS

You can also concatenate the output features of several model by providing several checkpoint to the --load option. For example the following command line:

python cpc/eval/linear_separability.py -$PATH_DB $TRAINING_SET $VAL_SET model1.pt model2.pt --pathCheckpoint $PATH_CHECKPOINT

Will evaluate the speaker separability of the concatenation of the features from model1 and model2.

--gru_level controls from which layer of autoregressive part of CPC to extract the features. By default it's the last one.

Nullspaces:

To conduct the nullspace experiment, first classify speakers using two factorized matrices A (DIM_EMBEDDING x DIM_INBETWEEN) and B (DIM_INBETWEEN x SPEAKERS). You'll want to extract A', the nullspace of matrix A (of size DIM_EMBEDDING x (DIM_EMBEDDING - DIM_INBETWEEN)), to make the embeddings less sensitive to speakers.

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode speakers_factorized  --model cpc --dim_inter $DIM_INBETWEEN --gru_level 2

Next, you evaluate the phone and speaker separabilities of the embeddings from CPC projected into the nullspace A'.

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode phonemes_nullspace --model cpc --pathPhone $PATH_TO_PHONE_LABELS --path_speakers_factorized $PATH_CHECKPOINT_SPEAKERS_FACTORIZED --dim_inter $DIM_INBETWEEN --gru_level 2
python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode speakers_nullspace --model cpc --path_speakers_factorized $PATH_CHECKPOINT_SPEAKERS_FACTORIZED --dim_inter $DIM_INBETWEEN --gru_level 2

ABX score:

You can run the ABX score on the Zerospeech2017 dataset. To begin, download the dataset here. Then run the ABX evaluation on a given checkpoint with:

python ABX.py from_checkpoint $PATH_CHECKPOINT $PATH_ITEM_FILE $DATASET_PATH --seq_norm --strict --file_extension .wav --out $PATH_OUT

Where:

  • $PATH_CHECKPOINT is the path pointing to the checkpoint to evaluate
  • $PATH_ITEM_FILE is the path to the .item file containing the triplet annotations
  • $DATASET_PATH path to the directory containing the audio files
  • $PATH_OUT path to the directory into which the results should be dumped
  • --seq_norm normalize each batch of features across the time channel before computing ABX
  • --strict forces each batch of features to contain exactly the same number of frames.

Cross lingual transfer

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

This model is also available via torch.hub. For more details, have a look at hubconf.py.

Citations

Please consider citing this project in your publications if it helps your research.

@misc{rivire2020unsupervised,
    title={Unsupervised pretraining transfers well across languages},
    author={Morgane Rivière and Armand Joulin and Pierre-Emmanuel Mazaré and Emmanuel Dupoux},
    year={2020},
    eprint={2002.02848},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

CPC_audio is MIT licensed, as found in the LICENSE file.

The second project in Python course on FCC

Assignment Write a function named add_time that takes in two required parameters and one optional parameter: a start time in the 12-hour clock format

Denise T 1 Dec 13, 2021
10th place solution for Google Smartphone Decimeter Challenge at kaggle.

Under refactoring 10th place solution for Google Smartphone Decimeter Challenge at kaggle. Google Smartphone Decimeter Challenge Global Navigation Sat

12 Oct 25, 2022
Learning Generative Models of Textured 3D Meshes from Real-World Images, ICCV 2021

Learning Generative Models of Textured 3D Meshes from Real-World Images This is the reference implementation of "Learning Generative Models of Texture

Dario Pavllo 115 Jan 07, 2023
Nerf pl - NeRF (Neural Radiance Fields) and NeRF in the Wild using pytorch-lightning

nerf_pl Update: an improved NSFF implementation to handle dynamic scene is open! Update: NeRF-W (NeRF in the Wild) implementation is added to nerfw br

AI葵 1.8k Dec 30, 2022
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

Google 625 Dec 30, 2022
Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021) This repository contains the code

149 Dec 15, 2022
Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergen

281 Dec 30, 2022
ESP32 python application to read data from a Tilt™ Hydrometer for homebrewing

TitlESP32 ESP32 MicroPython application to read and log data from a Tilt™ Hydrometer. Requirements A board with an ESP32 chip USB cable - USB A / micr

IoBeer 5 Dec 01, 2022
Benchmark VAE - Library for Variational Autoencoder benchmarking

Documentation pythae This library implements some of the most common (Variational) Autoencoder models. In particular it provides the possibility to pe

1.1k Jan 02, 2023
Using NumPy to solve the equations of fluid mechanics together with Finite Differences, explicit time stepping and Chorin's Projection methods

Computational Fluid Dynamics in Python Using NumPy to solve the equations of fluid mechanics 🌊 🌊 🌊 together with Finite Differences, explicit time

Felix Köhler 4 Nov 12, 2022
GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles

GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles This repository contains a method to generate 3D conformer ensembles direct

127 Dec 20, 2022
Keras udrl - Keras implementation of Upside Down Reinforcement Learning

keras_udrl Keras implementation of Upside Down Reinforcement Learning This is me

Eder Santana 7 Jan 24, 2022
Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging"

Deep Optics for Single-shot High-dynamic-range Imaging Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging" CVPR, 2

Stanford Computational Imaging Lab 40 Dec 12, 2022
Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*. The algorithm was extremely

1 Mar 28, 2022
Styled Handwritten Text Generation with Transformers (ICCV 21)

⚡ Handwriting Transformers [PDF] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan & Mubarak Shah Abstract: We

Ankan Kumar Bhunia 85 Dec 22, 2022
Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021) The implementation of Reducing Infromation Bottleneck for W

Jungbeom Lee 81 Dec 16, 2022
This toolkit provides codes to download and pre-process the SLUE datasets, train the baseline models, and evaluate SLUE tasks.

slue-toolkit We introduce Spoken Language Understanding Evaluation (SLUE) benchmark. This toolkit provides codes to download and pre-process the SLUE

ASAPP Research 39 Sep 21, 2022
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

OpenAI 3k Dec 26, 2022
[ICLR 2021] "CPT: Efficient Deep Neural Network Training via Cyclic Precision" by Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin

CPT: Efficient Deep Neural Network Training via Cyclic Precision Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin Accep

26 Oct 25, 2022
Point Cloud Registration using Representative Overlapping Points.

Point Cloud Registration using Representative Overlapping Points (ROPNet) Abstract 3D point cloud registration is a fundamental task in robotics and c

ZhuLifa 36 Dec 16, 2022