Pseudo-Visual Speech Denoising

Last update: Oct 22, 2022

Overview

Pseudo-Visual Speech Denoising

This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar

📝 Paper	📑 Project Page	🛠 Demo Video	🗃 Real-World Test Set
Paper	Website	Video	Real-World Test Set (coming soon)

Features

Denoise any real-world audio/video and obtain the clean speech.
Works in unconstrained settings for any speaker in any language.
Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
Complete training code and inference codes available.

Prerequisites

Python 3.7.4 (Code has been tested with this version)
ffmpeg: sudo apt-get install ffmpeg
Install necessary packages using pip install -r requirements.txt
Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth

Getting the weights

Model	Description	Link to the model
Denoising model	Weights of the denoising model (needed for inference)	Link
Lipsync student	Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference)	Link
Wav2Lip teacher	Weights of the teacher lipsync model (only needed if you want to train the network from scratch)	Link

Denoising any audio/video using the pre-trained model (Inference)

You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:

python inference.py --lipsync_student_model_path= --checkpoint_path= --input=

The result is saved (by default) in results/result.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.

Generating only the lip-movements for any given noisy audio/video

The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:

cd lipsync
python inference.py --checkpoint_path= --audio=

The result is saved (by default) in results/result_voice.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the visual stream.

Training

We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure

data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
├── list of folders
│   ├── five-digit numbered video IDs ending with (.mp4)

Preprocess the dataset

python preprocess.py --data_root= --preprocessed_root=

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS3 folder structure

preprocessed_root (lrs3_preprocessed)
├── list of folders
|	├── Folders with five-digit numbered video IDs
|	│   ├── *.jpg (extracted face crops from each frame)

VGGSound folder structure

We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files) from here.

data_root (vgg_sound)
├── *.wav (audio files)

Train!

There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.

Train the Student-Lipsync model

Navigate to the lipsync folder: cd lipsync

The lipsync model can be trained using:

python train_student.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --wav2lip_checkpoint_path= --checkpoint_dir=

Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.

Train the Denoising model!

Navigate to the main directory: cd ..

The denoising model can be trained using:

python train.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --lipsync_student_model_path= --checkpoint_dir=

The model can be resumed for training as well. Look at python train.py --help for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py file.

Evaluation

To be updated soon!

Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

@InProceedings{Hegde_2021_WACV,
    author    = {Hegde, Sindhu B. and Prajwal, K.R. and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title     = {Visual Speech Enhancement Without a Real Visual Stream},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {1926-1935}
}

Acknowledgements

Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the authors for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

Pseudo-Visual Speech Denoising

Related tags

Overview

Pseudo-Visual Speech Denoising

Features

Prerequisites

Getting the weights

Denoising any audio/video using the pre-trained model (Inference)

Generating only the lip-movements for any given noisy audio/video

Training

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure

Preprocess the dataset

Preprocessed LRS3 folder structure

VGGSound folder structure

Train!

Train the Student-Lipsync model

Train the Denoising model!

Evaluation

Licence and Citation

Acknowledgements

Owner

Sindhu

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

BEGAN in PyTorch

Repository for self-supervised landmark discovery

GEA - Code for Guided Evolution for Neural Architecture Search

Official Pytorch Implementation of Unsupervised Image Denoising with Frequency Domain Knowledge

Measuring Coding Challenge Competence With APPS

The code uses SegFormer for Semantic Segmentation on Drone Dataset.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

code for `Look Closer to Segment Better: Boundary Patch Refinement for Instance Segmentation`

This is a simple face recognition mini project that was completed by a team of 3 members in 1 week's time

ChainerRL is a deep reinforcement learning library built on top of Chainer.

Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

DenseNet Implementation in Keras with ImageNet Pretrained Models

一些经典的CTR算法的复现; LR, FM, FFM, AFM, DeepFM，xDeepFM, PNN, DCN, DCNv2, DIFM, AutoInt, FiBiNet,AFN,ONN,DIN, DIEN ... （pytorch, tf2.0）

GANimation: Anatomically-aware Facial Animation from a Single Image (ECCV'18 Oral) [PyTorch]

PG2Net: Personalized and Group PreferenceGuided Network for Next Place Prediction

Official code for 'Robust Siamese Object Tracking for Unmanned Aerial Manipulator' and offical introduction to UAMT100 benchmark

OREO: Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning (NeurIPS 2021)

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified