Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Last update: Dec 23, 2022

Overview

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

We propose Disentangled Audio-Visual System (DAVS) to address arbitrary-subject talking face generation in this work, which aims to synthesize a sequence of face images that correspond to given speech semantics, conditioning on either an unconstrained speech audio or video.

[Project] [Paper] [Demo]

Recommondation of our CVPR21 repo

This repo is barely maintaining since the version of this code is out of date. If you are interested in the topic of Talking Face Generation, feel free to try the CODE of our CVPR2021 PAPER!

Requirements

python 2.7
PyTorch（We use version 0.2.0)
opencv2

Generating test results

Download the pre-trained model checkpoint

Create the default folder "checkpoints" and put the checkpoint in it or get the CHECKPOINT_PATH

Samples for testing can be found in this folder named 0572_0019_0003. This is a pre-processed sample from the Voxceleb Dataset.
Run the testing script to generate videos from video:

python test_all.py  --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH

Run the testing script to generate videos from audio:

python test_all.py  --test_root ./0572_0019_0003/audio --test_type audio --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH

Sample Results

Talking Effect on Human Characters

Talking Effect on Non-human Characters (Trained on Human Faces Only)

Create more samples

The face detection tool used in the demo videos can be found at RSA. It will return a Matfile with 5 key point locations in a row for each image. Other face alignment methods are also appliable such as dlib. The key points for face alignement we used are the two for the center of the eyes and the average point of the corners of the mouth. With each image's PATH and the face POINTS, you can find our way of face alignment at preprocess/face_align.py.
Our preprocessing of the audio files is the same and borrowed from the matlab code of SyncNet. Then we save the mfcc features into bin files.

Preparing Training Data

We used the LRW dataset for training.
The directories are arranged like this:

data
├── train, val, test
|	├── 0, 1, 2 ... 499 (one folder for each class)
|	│   ├── 0, 1, 2 ... #videos per class
|	│   │   ├── align_face256
|	│   │   |   ├── 0, 1, ... 28.jpg
|	│   |   ├── mfcc20
|	│   │   |   ├── 2, 3 ... 26.bin

where each video is extracted to frames and aligned using our protocol, and each audio is processed and saved using Matlab.

Training

python train.py

This is still a beta version of the training code which only disentangles wid information from pid space. Running the train.py only might not be able to fully reproduce the paper. However, it can be served as a reference for how we implement the whole training process.
During our own implementation, the classification part (without generation and disentanglement) is pretrained first. The pretraining training code is temporarily not provided.

Postprocessing Details (Optional)

The directly generated results may suffer from a "zoom-in-and-out" condition which we assume is caused by our alignment of the training set. We solve the unstable problem using Subspace Video Stabilization in the demos.

License and Citation

The use of this software is RESTRICTED to non-commercial research and educational purposes.

@inproceedings{zhou2019talking,
  title     = {Talking Face Generation by Adversarially Disentangled Audio-Visual Representation},
  author    = {Zhou, Hang and Liu, Yu and Liu, Ziwei and Luo, Ping and Wang, Xiaogang},
  booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
  year      = {2019},
}

Acknowledgement

The structure of this codebase is borrowed from pix2pix.

Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Related tags

Overview

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Recommondation of our CVPR21 repo

Requirements

Generating test results

Sample Results

Create more samples

Preparing Training Data

Training

Postprocessing Details (Optional)

License and Citation

Acknowledgement

Owner

Hang_Zhou

Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neurons learned with Gradient descent or LeLevenberg–Marquardt algorithm

Face recognize and crop them

ProjectOxford-ClientSDK - This repo has moved :house: Visit our website for the latest SDKs & Samples

Simple image captioning model - CLIP prefix captioning.

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

An implementation of paper `Real-time Convolutional Neural Networks for Emotion and Gender Classification` with PaddlePaddle.

Official Implementation of Few-shot Visual Relationship Co-localization

Deep learning model for EEG artifact removal

This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes.

Pairwise model for commonlit competition

Code for our WACV 2022 paper "Hyper-Convolution Networks for Biomedical Image Segmentation"

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Session-aware Item-combination Recommendation with Transformer Network

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

My personal Home Assistant configuration.

Repo for flood prediction using LSTMs and HAND

ROMP: Monocular, One-stage, Regression of Multiple 3D People, ICCV21

Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Related tags

Overview

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Recommondation of our CVPR21 repo

Requirements

Generating test results

Sample Results

Create more samples

Preparing Training Data

Training

Postprocessing Details (Optional)

License and Citation

Acknowledgement

Owner

Hang_Zhou

Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neurons learned with Gradient descent or LeLevenberg–Marquardt algorithm

Face recognize and crop them

ProjectOxford-ClientSDK - This repo has moved :house: Visit our website for the latest SDKs & Samples

Simple image captioning model - CLIP prefix captioning.

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

An implementation of paper `Real-time Convolutional Neural Networks for Emotion and Gender Classification` with PaddlePaddle.

Official Implementation of Few-shot Visual Relationship Co-localization

Deep learning model for EEG artifact removal

This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes.

Pairwise model for commonlit competition

Code for our WACV 2022 paper "Hyper-Convolution Networks for Biomedical Image Segmentation"

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Session-aware Item-combination Recommendation with Transformer Network

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

My personal Home Assistant configuration.

Repo for flood prediction using LSTMs and HAND

ROMP: Monocular, One-stage, Regression of Multiple 3D People, ICCV21

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥