Unofficial PyTorch implementation of Google AI's VoiceFilter system

Overview

VoiceFilter

Note from Seung-won (2020.10.25)

Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-source, and I didn't expect this repository to grab such a great amount of attention for a long time. I would like to thank everyone for giving such attention, and also Mr. Quan Wang (the first author of the VoiceFilter paper) for referring this project in his paper.

Actually, this project was done by me when it was only 3 months after I started studying deep learning & speech separation without a supervisor in the relevant field. Back then, I didn't know what is a power-law compression, and the correct way to validate/test the models. Now that I've spent more time on deep learning & speech since then (I also wrote a paper published at Interspeech 2020 😊 ), I can observe some obvious mistakes that I've made. Those issues were kindly raised by GitHub users; please refer to the Issues and Pull Requests for that. That being said, this repository can be quite unreliable, and I would like to remind everyone to use this code at their own risk (as specified in LICENSE).

Unfortunately, I can't afford extra time on revising this project or reviewing the Issues / Pull Requests. Instead, I would like to offer some pointers to newer, more reliable resources:

  • VoiceFilter-Lite: This is a newer version of VoiceFilter presented at Interspeech 2020, which is also written by Mr. Quan Wang (and his colleagues at Google). I highly recommend checking this paper, since it focused on a more realistic situation where VoiceFilter is needed.
  • List of VoiceFilter implementation available on GitHub: In March 2019, this repository was the only available open-source implementation of VoiceFilter. However, much better implementations that deserve more attention became available across GitHub. Please check them, and choose the one that meets your demand.
  • PyTorch Lightning: Back in 2019, I could not find a great deep-learning project template for myself, so I and my colleagues had used this project as a template for other new projects. For people who are searching for such project template, I would like to strongly recommend PyTorch Lightning. Even though I had done a lot of effort into developing my own template during 2019 (VoiceFilter -> RandWireNN -> MelNet -> MelGAN), I found PyTorch Lightning much better than my own template.

Thanks for reading, and I wish everyone good health during the global pandemic situation.

Best regards, Seung-won Park


Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

  • Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample

Metric

Median SDR Paper Ours
before VoiceFilter 2.5 1.9
after VoiceFilter 12.6 10.2

  • SDR converged at 10, which is slightly lower than paper's.

Dependencies

  1. Python and packages

    This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:

    pip install -r requirements.txt
  2. Miscellaneous

    ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

  1. Download LibriSpeech dataset

    To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

  2. Resample & Normalize wav files

    First, unzip tar.gz file to desired folder:

    tar -xvzf train-clear-360.tar.gz

    Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

    vim normalize-resample.sh # set "N" as your CPU core number.
    chmod a+x normalize-resample.sh
    ./normalize-resample.sh # this may take long
  3. Edit config.yaml

    cd config
    cp default.yaml config.yaml
    vim config.yaml
  4. Preprocess wav files

    In order to boost training speed, perform STFT for each files before training by:

    python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

    This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

  1. Get pretrained model for speaker recognition system

    VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

    This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.

    Update: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.

    The model can be downloaded at this GDrive link.

  2. Run

    After specifying train_dir, test_dir at config.yaml, run:

    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

    This will create chkpt/name and logs/name at base directory(-b option, . in default)

  3. View tensorboardX

    tensorboard --logdir ./logs

  4. Resuming from checkpoint

    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name

Evaluate

python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

  • Try power-law compressed reconstruction error as loss function, instead of MSE. (See #14)

Author

Seungwon Park at MINDsLab ([email protected], [email protected])

License

Apache License 2.0

This repository contains codes adapted/copied from the followings:

Owner
MINDs Lab
MINDsLab provides AI platform and various AI engines based on deep machine learning.
MINDs Lab
Leaderboard and Visualization for RLCard

RLCard Showdown This is the GUI support for the RLCard project and DouZero project. RLCard-Showdown provides evaluation and visualization tools to hel

Data Analytics Lab at Texas A&M University 246 Dec 26, 2022
这是一个yolox-keras的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Keras当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤 Ho

Bubbliiiing 64 Nov 10, 2022
KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%) KuaiRec is a real-world dataset collected from the recommendation log

Chongming GAO (高崇铭) 70 Dec 28, 2022
This repository contains a Ruby API for utilizing TensorFlow.

tensorflow.rb Description This repository contains a Ruby API for utilizing TensorFlow. Linux CPU Linux GPU PIP Mac OS CPU Not Configured Not Configur

somatic labs 825 Dec 26, 2022
A disassembler for the RP2040 Programmable I/O State-machine!

piodisasm A disassembler for the RP2040 Programmable I/O State-machine! Usage Just run piodisasm.py on a file that contains the PIO code as hex! (Such

Ghidra Ninja 29 Dec 06, 2022
The official code of Anisotropic Stroke Control for Multiple Artists Style Transfer

ASMA-GAN Anisotropic Stroke Control for Multiple Artists Style Transfer Proceedings of the 28th ACM International Conference on Multimedia The officia

Six_God 146 Nov 21, 2022
Public Models considered for emotion estimation from EEG

Emotion-EEG Set of models for emotion estimation from EEG. Composed by the combination of two deep-learing models learning together (RNN and CNN) with

Victor Delvigne 21 Dec 23, 2022
An atmospheric growth and evolution model based on the EVo degassing model and FastChem 2.0

EVolve Linking planetary mantles to atmospheric chemistry through volcanism using EVo and FastChem. Overview EVolve is a linked mantle degassing and a

Pip Liggins 2 Jan 17, 2022
Bald-to-Hairy Translation Using CycleGAN

GANiry: Bald-to-Hairy Translation Using CycleGAN Official PyTorch implementation of GANiry. GANiry: Bald-to-Hairy Translation Using CycleGAN, Fidan Sa

Fidan Samet 10 Oct 27, 2022
Fast SHAP value computation for interpreting tree-based models

FastTreeSHAP FastTreeSHAP package is built based on the paper Fast TreeSHAP: Accelerating SHAP Value Computation for Trees published in NeurIPS 2021 X

LinkedIn 369 Jan 04, 2023
Self-training for Few-shot Transfer Across Extreme Task Differences

Self-training for Few-shot Transfer Across Extreme Task Differences (STARTUP) Introduction This repo contains the official implementation of the follo

Cheng Perng Phoo 33 Oct 31, 2022
Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation. It was introduced in Wright, Logan G. & Onodera, Tatsuhiro et al. (2021)1 to train Physical Neural Networ

McMahon Lab 230 Jan 05, 2023
Some bravo or inspiring research works on the topic of curriculum learning.

Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtu

131 Jan 07, 2023
The official repository for "Score Transformer: Generating Musical Scores from Note-level Representation" (MMAsia '21)

Score Transformer This is the official repository for "Score Transformer": Score Transformer: Generating Musical Scores from Note-level Representation

22 Dec 22, 2022
Style transfer, deep learning, feature transform

FastPhotoStyle License Copyright (C) 2018 NVIDIA Corporation. All rights reserved. Licensed under the CC BY-NC-SA 4.0 license (https://creativecommons

NVIDIA Corporation 10.9k Jan 02, 2023
The code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning"

The Code for MM2021 paper "Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning" Setting up and using the repo Get the dataset. Follow

4 Apr 20, 2022
Neural style in TensorFlow! 🎨

neural-style An implementation of neural style in TensorFlow. This implementation is a lot simpler than a lot of the other ones out there, thanks to T

Anish Athalye 5.5k Dec 29, 2022
Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

Damian Panek 176 Nov 28, 2022
BRNet - code for Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss function

BRNet code for "Automated assessment of BI-RADS categories for ultrasound images using multi-scale neural networks with an order-constrained loss func

Yong Pi 2 Mar 09, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022