PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

This repository contains the PyTorch implementation of the following paper:

Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, and Yong Man Ro
[Paper] [Demo Video]

Preparation

Requirements

  • python 3.7
  • pytorch 1.6 ~ 1.8
  • torchvision
  • torchaudio
  • ffmpeg
  • av
  • tensorboard
  • scikit-image
  • pillow
  • librosa
  • pystoi
  • pesq
  • scipy

Datasets

Download

GRID dataset (video normal) can be downloaded from the below link.

For data preprocessing, download the face landmark of GRID from the below link.

Preprocessing

After download the dataset, preprocess the dataset with the following scripts in ./preprocess.
It supposes the data directory is constructed as

Data_dir
├── subject
|   ├── video
|   |   └── xxx.mpg
  1. Extract frames
    Extract_frames.py extract images and audio from the video.
python Extract_frames.py --Grid_dir "Data dir of GRID_corpus" --Out_dir "Output dir of images and audio of GRID_corpus"
  1. Align faces and audio processing
    Preprocess.py aligns faces and generates videos, which enables cropping the video lip-centered during training.
python Preprocess.py \
--Data_dir "Data dir of extracted images and audio of GRID_corpus" \
--Landmark "Downloaded landmark dir of GRID" \
--Output_dir "Output dir of processed data"

Training the Model

The speaker setting (different subject) can be selected by subject argument. Please refer to below examples.
To train the model, run following command:

# Data Parallel training example using 4 GPUs for multi-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 88 \
--epochs 500 \
--subject 'overlap' \
--eval_step 720 \
--dataparallel \
--gpu 0,1,2,3
# 1 GPU training example for GRID for unseen-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 22 \
--epochs 500 \
--subject 'unseen' \
--eval_step 1000 \
--gpu 0

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint_dir: directory for saving checkpoints
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --epochs: number of epochs
  • --augmentations: whether performing augmentation
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • --lr: learning rate
  • --eval_step: steps for performing evaluation
  • --window_size: number of frames to be used for training
  • Refer to train.py for the other training parameters

The evaluation during training is performed for a subset of the validation dataset due to the heavy time costs of waveform conversion (griffin-lim).
In order to evaluate the entire performance of the trained model run the test code (refer to "Testing the Model" section).

check the training logs

tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'

The tensorboard shows the training and validation loss, evaluation metrics, generated mel-spectrogram, and audio

Testing the Model

To test the model, run following command:

# Dataparallel test example for multi-speaker setting in GRID
python test.py \
--grid 'enter_the_processed_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 100 \
--subject 'overlap' \
--save_mel \
--save_wav \
--dataparallel \
--gpu 0,1

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --save_mel: whether to save the 'mel_spectrogram' and 'spectrogram' in .npz format
  • --save_wav: whether to save the 'waveform' in .wav format
  • --gpu: gpu number for training
  • Refer to test.py for the other parameters

Test Automatic Speech Recognition (ASR) results of generated results: WER

Transcription (Ground-truth) of GRID dataset can be downloaded from the below link.

move to the ASR_model directory

cd ASR_model/GRID

To evaluate the WER, run following command:

# test example for multi-speaker setting in GRID
python test.py \
--data 'enter_the_generated_data_dir (mel or wav) (ex. ./../../test/spec_mel)' \
--gtpath 'enter_the_downloaded_transcription_path' \
--subject 'overlap' \
--gpu 0

Descriptions of training parameters are as follows:

  • --data: Data for evaluation (wav or mel(.npz))
  • --wav : whether the data is waveform or not
  • --batch_size: batch size
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • Refer to ./ASR_model/GRID/test.py for the other parameters

Pre-trained ASR model checkpoint

Below lists are the pre-trained ASR model to evaluate the generated speech.
WER shows the original performances of the model on ground-truth audio.

Setting WER
GRID (constrained-speaker) 0.83 %
GRID (multi-speaker) 1.67 %
GRID (unseen-speaker) 0.37 %
LRW 1.54 %

Put the checkpoints in ./ASR_model/GRID/data for GRID, and in ./ASR_model/LRW/data for LRW.

Citation

If you find this work useful in your research, please cite the paper:

@article{kim2021vcagan,
  title={Lip to Speech Synthesis with Visual Context Attentional GAN},
  author={Kim, Minsu and Hong, Joanna and Ro, Yong Man},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
Simply enable or disable your Nvidia dGPU

EnvyControl (WIP) Simply enable or disable your Nvidia dGPU Usage First clone this repo and install envycontrol with sudo pip install . CLI Turn off y

Victor Bayas 292 Jan 03, 2023
Graph Self-Supervised Learning for Optoelectronic Properties of Organic Semiconductors

SSL_OSC Graph Self-Supervised Learning for Optoelectronic Properties of Organic Semiconductors

zaixizhang 2 May 14, 2022
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022
Pytorch library for seismic data augmentation

Pytorch library for seismic data augmentation

Artemii Novoselov 27 Nov 22, 2022
Exploring Image Deblurring via Blur Kernel Space (CVPR'21)

Exploring Image Deblurring via Encoded Blur Kernel Space About the project We introduce a method to encode the blur operators of an arbitrary dataset

VinAI Research 118 Dec 19, 2022
A PyTorch implementation of the Relational Graph Convolutional Network (RGCN).

Torch-RGCN Torch-RGCN is a PyTorch implementation of the RGCN, originally proposed by Schlichtkrull et al. in Modeling Relational Data with Graph Conv

Thiviyan Singam 66 Nov 30, 2022
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. Training python train.py --c

Rishikesh (ऋषिकेश) 55 Dec 26, 2022
[ICCV 2021] Our work presents a novel neural rendering approach that can efficiently reconstruct geometric and neural radiance fields for view synthesis.

MVSNeRF Project page | Paper This repository contains a pytorch lightning implementation for the ICCV 2021 paper: MVSNeRF: Fast Generalizable Radiance

Anpei Chen 529 Dec 30, 2022
New approach to benchmark VQA models

VQA Benchmarking This repository contains the web application & the python interface to evaluate VQA models. Documentation Please see the documentatio

4 Jul 25, 2022
PyTorch implementation of Munchausen Reinforcement Learning based on DQN and SAC. Handles discrete and continuous action spaces

Exploring Munchausen Reinforcement Learning This is the project repository of my team in the "Advanced Deep Learning for Robotics" course at TUM. Our

Mohamed Amine Ketata 10 Mar 10, 2022
GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data

GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data By Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, W

Taihong Xiao 141 Apr 16, 2021
Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

Welcome to the CSIRO Face Analysis SDK. Documentation for the SDK can be found in doc/documentation.html. All code in this SDK is provided according t

Luiz Carlos Vieira 7 Jul 16, 2020
Dynamic Capacity Networks using Tensorflow

Dynamic Capacity Networks using Tensorflow Dynamic Capacity Networks (DCN; http://arxiv.org/abs/1511.07838) implementation using Tensorflow. DCN reduc

Taeksoo Kim 8 Feb 23, 2021
Official code for "EagerMOT: 3D Multi-Object Tracking via Sensor Fusion" [ICRA 2021]

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion Read our ICRA 2021 paper here. Check out the 3 minute video for the quick intro or the full prese

Aleksandr Kim 276 Dec 30, 2022
这是一个利用facenet和retinaface实现人脸识别的库,可以进行在线的人脸识别。

Facenet+Retinaface:人脸识别模型在Pytorch当中的实现 目录 注意事项 Attention 所需环境 Environment 文件下载 Download 预测步骤 How2predict 参考资料 Reference 注意事项 该库中包含了两个网络,分别是retinaface和

Bubbliiiing 102 Dec 30, 2022
It's a powerful version of linebot

CTPS-FINAL Linbot-sever.py 主程式 Algorithm.py 推薦演算法,媒合餐廳端資料與顧客端資料 config.ini 儲存 channel-access-token、channel-secret 資料 Preface 生活在成大將近4年,我們每天的午餐時間看著形形色色

1 Oct 17, 2022
MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

2 Jan 29, 2022
Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)

Junction Tree Variational Autoencoder for Molecular Graph Generation Official implementation of our Junction Tree Variational Autoencoder https://arxi

Wengong Jin 418 Jan 07, 2023
Some code of the implements of Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network

3D-GMPDCNN Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network PyTorch implementation of "Geological Modeling Usin

5 Nov 21, 2022
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

zhangtao 146 Dec 29, 2022