Stitch it in Time: GAN-Based Facial Editing of Real Videos

Related tags

Deep LearningSTIT
Overview

STIT - Stitch it in Time

arXiv CGP WAI

[Project Page]

Stitch it in Time: GAN-Based Facial Editing of Real Videos
Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, Daniel Cohen-Or

Abstract:
The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Sets of high-quality facial videos are lacking, and working with videos introduces a fundamental barrier to overcome - temporal coherency. We propose that this barrier is largely artificial. The source video is already temporally coherent, and deviations from this state arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low frequency functions, and demonstrate that they provide a strongly consistent prior. We draw on these insights and propose a framework for semantic editing of faces in videos, demonstrating significant improvements over the current state-of-the-art. Our method produces meaningful face manipulations, maintains a higher degree of temporal consistency, and can be applied to challenging, high quality, talking head videos which current methods struggle with.

Requirements

Pytorch(tested with 1.10, should work with 1.8/1.9 as well) + torchvision

For the rest of the requirements, run:

pip install Pillow imageio imageio-ffmpeg dlib face-alignment opencv-python click wandb tqdm scipy matplotlib clip lpips 

Pretrained models

In order to use this project you need to download pretrained models from the following Link.

Unzip it inside the project's main directory.

You can use the download_models.sh script (requires installing gdown with pip install gdown)

Alternatively, you can unzip the models to a location of your choice and update configs/path_config.py accordingly.

Splitting videos into frames

Our code expects videos in the form of a directory with individual frame images. To produce such a directory from an existing video, we recommend using ffmpeg:

ffmpeg -i "video.mp4" "video_frames/out%04d.png"

Example Videos

The videos used to produce our results can be downloaded from the following Link.

Inversion

To invert a video run:

python train.py --input_folder /path/to/images_dir \ 
 --output_folder /path/to/experiment_dir \
 --run_name RUN_NAME \
 --num_pti_steps NUM_STEPS

This includes aligning, cropping, e4e encoding and PTI

For example:

python train.py --input_folder /data/obama \ 
 --output_folder training_results/obama \
 --run_name obama \
 --num_pti_steps 80

Weights and biases logging is disabled by default. to enable, add --use_wandb

Naive Editing

To run edits without stitching tuning:

python edit_video.py --input_folder /path/to/images_dir \ 
 --output_folder /path/to/experiment_dir \
 --run_name RUN_NAME \
 --edit_name EDIT_NAME \
 --edit_range EDIT_RANGE \  

edit_range determines the strength of the edits applied. It should be in the format RANGE_START RANGE_END RANGE_STEPS.
for example, if we use --edit_range 1 5 2, we will apply edits with strength 1, 3 and 5.

For young Obama use:

python edit_video.py --input_folder /data/obama \ 
 --output_folder edits/obama/ \
 --run_name obama \
 --edit_name age \
 --edit_range -8 -8 1 \  

Editing + Stitching Tuning

To run edits with stitching tuning:

python edit_video_stitching_tuning.py --input_folder /path/to/images_dir \ 
 --output_folder /path/to/experiment_dir \
 --run_name RUN_NAME \
 --edit_name EDIT_NAME \
 --edit_range EDIT_RANGE \
 --outer_mask_dilation MASK_DILATION

We support early breaking the stitching tuning process, when the loss reaches a specified threshold.
This enables us to perform more iterations for difficult frames while maintaining a reasonable running time.
To use this feature, add --border_loss_threshold THRESHOLD to the command(Shown in the Jim and Kamala Harris examples below).
For videos with a simple background to reconstruct (e.g Obama, Jim, Emma Watson, Kamala Harris), we use THRESHOLD=0.005.
For videos where a more exact reconstruction of the background is required (e.g Michael Scott), we use THRESHOLD=0.002.
Early breaking is disabled by default.

For young Obama use:

python edit_video_stitching_tuning.py --input_folder /data/obama \ 
 --output_folder edits/obama/ \
 --run_name obama \
 --edit_name age \
 --edit_range -8 -8 1 \  
 --outer_mask_dilation 50

For gender editing on Obama use:

python edit_video_stitching_tuning.py --input_folder /data/obama \ 
 --output_folder edits/obama/ \
 --run_name obama \
 --edit_name gender \
 --edit_range -6 -6 1 \  
 --outer_mask_dilation 50

For young Emma Watson use:

python edit_video_stitching_tuning.py --input_folder /data/emma_watson \ 
 --output_folder edits/emma_watson/ \
 --run_name emma_watson \
 --edit_name age \
 --edit_range -8 -8 1 \  
 --outer_mask_dilation 50

For smile removal on Emma Watson use:

python edit_video_stitching_tuning.py --input_folder /data/emma_watson \ 
 --output_folder edits/emma_watson/ \
 --run_name emma_watson \
 --edit_name smile \
 --edit_range -3 -3 1 \  
 --outer_mask_dilation 50

For Emma Watson lipstick editing use: (done with styleclip global direction)

python edit_video_stitching_tuning.py --input_folder /data/emma_watson \ 
 --output_folder edits/emma_watson/ \
 --run_name emma_watson \
 --edit_type styleclip_global \
 --edit_name lipstick \
 --neutral_class "Face" \
 --target_class "Face with lipstick" \
 --beta 0.2 \
 --edit_range 10 10 1 \  
 --outer_mask_dilation 50

For Old + Young Jim use (with early breaking):

python edit_video_stitching_tuning.py --input_folder datasets/jim/ \
 --output_folder edits/jim \
 --run_name jim \
 --edit_name age \
 --edit_range -8 8 2 \
 --outer_mask_dilation 50 \ 
 --border_loss_threshold 0.005

For smiling Kamala Harris:

python edit_video_stitching_tuning.py \
 --input_folder datasets/kamala/ \ 
 --output_folder edits/kamala \
 --run_name kamala \
 --edit_name smile \
 --edit_range 2 2 1 \
 --outer_mask_dilation 50 \
 --border_loss_threshold 0.005

Example Results

With stitching tuning:

out.mp4

Without stitching tuning:

out.mp4

Gender editing:

out.mp4

Young Emma Watson:

out.mp4

Emma Watson with lipstick:

out.mp4

Emma Watson smile removal:

out.mp4

Old Jim:

out.mp4

Young Jim:

out.mp4

Smiling Kamala Harris:

out.mp4

Out of domain video editing (Animations)

For editing out of domain videos, Some different parameters are required while training. First, dlib's face detector doesn't detect all animated faces, so we use a different face detector provided by the face_alignment package. Second, we reduce the smoothing of the alignment parameters with --center_sigma 0.0 Third, OOD videos require more training steps, as they are more difficult to invert.

To train, we use:

python train.py --input_folder datasets/ood_spiderverse_gwen/ \
 --output_folder training_results/ood \
 --run_name ood \
 --num_pti_steps 240 \
 --use_fa \
 --center_sigma 0.0

Afterwards, editing is performed the same way:

python edit_video.py --input_folder datasets/ood_spiderverse_gwen/ \
 --output_folder edits/ood --run_name ood \
 --edit_name smile --edit_range 2 2 1

out.mp4

python edit_video.py --input_folder datasets/ood_spiderverse_gwen/ \
 --output_folder edits/ood \
 --run_name ood \
 --edit_type styleclip_global
 --edit_range 10 10 1
 --edit_name lipstick
 --target_class 'Face with lipstick'

out.mp4

Credits:

StyleGAN2-ada model and implementation:
https://github.com/NVlabs/stylegan2-ada-pytorch Copyright © 2021, NVIDIA Corporation.
Nvidia Source Code License https://nvlabs.github.io/stylegan2-ada-pytorch/license.html

PTI implementation:
https://github.com/danielroich/PTI
Copyright (c) 2021 Daniel Roich
License (MIT) https://github.com/danielroich/PTI/blob/main/LICENSE

LPIPS model and implementation:
https://github.com/richzhang/PerceptualSimilarity
Copyright (c) 2020, Sou Uchida
License (BSD 2-Clause) https://github.com/richzhang/PerceptualSimilarity/blob/master/LICENSE

e4e model and implementation:
https://github.com/omertov/encoder4editing Copyright (c) 2021 omertov
License (MIT) https://github.com/omertov/encoder4editing/blob/main/LICENSE

StyleCLIP model and implementation:
https://github.com/orpatashnik/StyleCLIP Copyright (c) 2021 orpatashnik
License (MIT) https://github.com/orpatashnik/StyleCLIP/blob/main/LICENSE

StyleGAN2 Distillation for Feed-forward Image Manipulation - for editing directions:
https://github.com/EvgenyKashin/stylegan2-distillation
Copyright (c) 2019, Yandex LLC
License (Creative Commons NonCommercial) https://github.com/EvgenyKashin/stylegan2-distillation/blob/master/LICENSE

face-alignment Library:
https://github.com/1adrianb/face-alignment
Copyright (c) 2017, Adrian Bulat
License (BSD 3-Clause License) https://github.com/1adrianb/face-alignment/blob/master/LICENSE

face-parsing.PyTorch:
https://github.com/zllrunning/face-parsing.PyTorch
Copyright (c) 2019 zll
License (MIT) https://github.com/zllrunning/face-parsing.PyTorch/blob/master/LICENSE

Citation

If you make use of our work, please cite our paper:

@misc{tzaban2022stitch,
      title={Stitch it in Time: GAN-Based Facial Editing of Real Videos},
      author={Rotem Tzaban and Ron Mokady and Rinon Gal and Amit H. Bermano and Daniel Cohen-Or},
      year={2022},
      eprint={2201.08361},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
In this tutorial, you will perform inference across 10 well-known pre-trained object detectors and fine-tune on a custom dataset. Design and train your own object detector.

Object Detection Object detection is a computer vision task for locating instances of predefined objects in images or videos. In this tutorial, you wi

Ibrahim Sobh 62 Dec 25, 2022
Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World" (CVPR 2019 ORAL)

Large-Scale Long-Tailed Recognition in an Open World [Project] [Paper] [Blog] Overview Open Long-Tailed Recognition (OLTR) is the author's re-implemen

Zhongqi Miao 761 Dec 26, 2022
[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger 🍔 Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

Gsunshine 271 Dec 29, 2022
[ICCV'21] UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction Project Page | Paper | Supplementary | Video This reposit

331 Dec 28, 2022
An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

CPC_audio This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper Unsupervised Pretraining Transfers we

Meta Research 283 Dec 30, 2022
This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et al. 2020

README This is the official Pytorch implementation of "Lung Segmentation from Chest X-rays using Variational Data Imputation", Raghavendra Selvan et a

Raghav 42 Dec 15, 2022
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

INK Lab @ USC 19 Nov 30, 2022
Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition - NeurIPS2021

Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition Project Page | Video | Paper Implementation for Neural-PIL. A novel method wh

Computergraphics (University of Tübingen) 64 Dec 29, 2022
Source code for Transformer-based Multi-task Learning for Disaster Tweet Categorisation (UCD's participation in TREC-IS 2020A, 2020B and 2021A).

Source code for "UCD participation in TREC-IS 2020A, 2020B and 2021A". *** update at: 2021/05/25 This repo so far relates to the following work: Trans

Congcong Wang 4 Oct 19, 2021
Winners of the Facebook Image Similarity Challenge

Winners of the Facebook Image Similarity Challenge

DrivenData 111 Jan 05, 2023
Training vision models with full-batch gradient descent and regularization

Stochastic Training is Not Necessary for Generalization -- Training competitive vision models without stochasticity This repository implements trainin

Jonas Geiping 32 Jan 06, 2023
Alignment Attention Fusion framework for Few-Shot Object Detection

AAF framework Framework generalities This repository contains the code of the AAF framework proposed in this paper. The main idea behind this work is

Pierre Le Jeune 20 Dec 16, 2022
Deep deconfounded recommender (Deep-Deconf) for paper "Deep causal reasoning for recommendations"

Deep Causal Reasoning for Recommender Systems The codes are associated with the following paper: Deep Causal Reasoning for Recommendations, Yaochen Zh

Yaochen Zhu 22 Oct 15, 2022
Pun Detection and Location

Pun Detection and Location “The Boating Store Had Its Best Sail Ever”: Pronunciation-attentive Contextualized Pun Recognition Yichao Zhou, Jyun-yu Jia

lawson 3 May 13, 2022
Benchmarking Pipeline for Prediction of Protein-Protein Interactions

B4PPI Benchmarking Pipeline for the Prediction of Protein-Protein Interactions How this benchmarking pipeline has been built, and how to use it, is de

Loïc Lannelongue 4 Jun 27, 2022
LSTMs (Long Short Term Memory) RNN for prediction of price trends

Price Prediction with Recurrent Neural Networks LSTMs BTC-USD price prediction with deep learning algorithm. Artificial Neural Networks specifically L

5 Nov 12, 2021
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Live-Face-Detection Project Description: In this project, we will be using the live video feed from the camera to detect Faces. It will also detect so

Hassan Shahzad 3 Oct 02, 2021