Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Overview

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

This repository contains the code (in PyTorch) for the model introduced in the following paper:

Video Autoencoder: self-supervised disentanglement of 3D structure and motion
Zihang Lai, Sifei Liu, Alexi A. Efros, Xiaolong Wang
ICCV, 2021
[Paper] [Project Page] [12-min oral pres. video] [3-min supplemental video]

Figure

Citation

@inproceedings{Lai21a,
        title={Video Autoencoder: self-supervised disentanglement of 3D structure and motion},
        author={Lai, Zihang and Liu, Sifei and Efros, Alexei A and Wang, Xiaolong},
        booktitle={ICCV},
        year={2021}
}

Contents

  1. Introduction
  2. Data preparation
  3. Training
  4. Evaluation
  5. Pretrained model

Introduction

Figure We present Video Autoencoder for learning disentangled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the Video Autoencoder extracts a disentangled representation of the scene including: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera poses for each frame. These two representations will then be re-entangled for rendering the input video frames. Video Autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large-scale natural video datasets, and show generalization results on out-of-domain images.

Dependencies

The following dependencies are not strict - they are the versions that we use.

Data preparation

RealEstate10K:

  1. Download the dataset from RealEstate10K.
  2. Download videos from RealEstate10K dataset, decode videos into frames. You might find the RealEstate10K_Downloader written by cashiwamochi helpful. Organize the data files into the following structure:
RealEstate10K/
    train/
        0000cc6d8b108390.txt
        00028da87cc5a4c4.txt
        ...
    test/
        000c3ab189999a83.txt
        000db54a47bd43fe.txt
        ...
dataset/
    train/
        0000cc6d8b108390/
            52553000.jpg
            52586000.jpg
            ...
        00028da87cc5a4c4/
            ...
    test/
        000c3ab189999a83/
        ...
  1. Subsample the training set at one-third of the original frame-rate (so that the motion is sufficiently large). You can use scripts/subsample_dataset.py.
  2. A list of videos ids that we used (10K for training and 5K for testing) is provided here:
    1. Training video ids and testing video ids.
    2. Note: as time changes, the availability of videos could change.

Matterport 3D (this could be tricky):

  1. Install habitat-api and habitat-sim. You need to use the following repo version (see this SynSin issue for details):

    1. habitat-sim: d383c2011bf1baab2ce7b3cd40aea573ad2ddf71
    2. habitat-api: e94e6f3953fcfba4c29ee30f65baa52d6cea716e
  2. Download the models from the Matterport3D dataset and the point nav datasets. You should have a dataset folder with the following data structure:

    root_folder/
         mp3d/
             17DRP5sb8fy/
                 17DRP5sb8fy.glb  
                 17DRP5sb8fy.house  
                 17DRP5sb8fy.navmesh  
                 17DRP5sb8fy_semantic.ply
             1LXtFkjw3qL/
                 ...
             1pXnuDYAj8r/
                 ...
             ...
         pointnav/
             mp3d/
                 ...
    
  3. Walk-through videos for pretraining: We use a ShortestPathFollower function provided by the Habitat navigation package to generate episodes of tours of the rooms. See scripts/generate_matterport3d_videos.py for details.

  4. Training and testing view synthesis pairs: we generally follow the same steps as the SynSin data instruction. The main difference is that we precompute all the image pairs. See scripts/generate_matterport3d_train_image_pairs.py and scripts/generate_matterport3d_test_image_pairs.py for details.

###Replica:

  1. Testing view synthesis pairs: This procedure is similar to step 4 in Matterport3D - with only the specific dataset changed. See scripts/generate_replica_test_image_pairs.py for details.

Configurations

Finally, change the data paths in configs/dataset.yaml to your data location.

Pre-trained models

  • Pre-trained model (RealEstate10K): Link
  • Pre-trained model (Matterport3D): Link

Training:

Use this script:

CUDA_VISIBLE_DEVICES=0,1 python train.py --savepath log/train --dataset RealEstate10K

Some optional commands (w/ default value in square bracket):

  • Select dataset: --dataset [RealEstate10K]
  • Interval between clip frames: --interval [1]
  • Change clip length: --clip_length [6]
  • Increase/decrease lr step: --lr_adj [1.0]
  • For Matterport3D finetuning, you need to set --clip_length 2, because the data are pairs of images.

Evaluation:

1. Generate test results:

Use this script (for testing RealEstate10K):

CUDA_VISIBLE_DEVICES=0 python test_re10k.py --savepath log/model --resume log/model/checkpoint.tar --dataset RealEstate10K

or this script (for testing Matterport3D/Replica):

CUDA_VISIBLE_DEVICES=0 python test_mp3d.py --savepath log/model --resume log/model/checkpoint.tar --dataset Matterport3D

Some optional commands:

  • Select dataset: --dataset [RealEstate10K]
  • Max number of frames: --frame_limit [30]
  • Max number of sequences: --video_limit [100]
  • Use training set to evaluate: --train_set

Running this will generate a output folder where the results (videos and poses) save. If you want to visualize the pose, use packages for evaluation of odometry, such as evo. If you want to quantitatively evaluate the results, see 2.1, 2.2.

2.1 Quantitative Evaluation of synthesis results:

Use this script:

python eval_syn_re10k.py [OUTPUT_DIR] (for RealEstate10K)
python eval_syn_mp3d.py [OUTPUT_DIR] (for Matterport3D)

Optional commands:

  • Evaluate LPIPS: --lpips

2.2 Quantitative Evaluation of pose prediction results:

Use this script:

python eval_pose.py [POSE_DIR]

Contact

For any questions about the code or the paper, you can contact zihang.lai at gmail.com.

Owner
Working from home
This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

Code-and-Dataset-for-CapSal This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detec

lu zhang 48 Aug 19, 2022
Official code repository for the EMNLP 2021 paper

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization PyTorch code for the EMNLP 2021 paper "Integrating Visuospatia

Adyasha Maharana 23 Dec 19, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner This repository is the official implementation of Meta-rPPG: Remote Heart Ra

Eugene Lee 137 Dec 13, 2022
phylotorch-bito is a package providing an interface to BITO for phylotorch

phylotorch-bito phylotorch-bito is a package providing an interface to BITO for phylotorch Dependencies phylotorch BITO Installation Get the source co

Mathieu Fourment 2 Sep 01, 2022
Unofficial Implementation of RobustSTL: A Robust Seasonal-Trend Decomposition Algorithm for Long Time Series (AAAI 2019)

RobustSTL: A Robust Seasonal-Trend Decomposition Algorithm for Long Time Series (AAAI 2019) This repository contains python (3.5.2) implementation of

Doyup Lee 222 Dec 21, 2022
A library for building and serving multi-node distributed faiss indices.

About Distributed faiss index service. A lightweight library that lets you work with FAISS indexes which don't fit into a single server memory. It fol

Meta Research 170 Dec 30, 2022
Photographic Image Synthesis with Cascaded Refinement Networks - Pytorch Implementation

Photographic Image Synthesis with Cascaded Refinement Networks-Pytorch (https://arxiv.org/abs/1707.09405) This is a Pytorch implementation of cascaded

Soumya Tripathy 63 Mar 27, 2022
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR)

Personalized Transfer of User Preferences for Cross-domain Recommendation (PTUPCDR) This is the official implementation of our paper Personalized Tran

Yongchun Zhu 81 Dec 29, 2022
Check out the StyleGAN repo and place it in the same directory hierarchy as the present repo

Variational Model Inversion Attacks Kuan-Chieh Wang, Yan Fu, Ke Li, Ashish Khisti, Richard Zemel, Alireza Makhzani Most commands are in run_scripts. W

Jackson Wang 15 Dec 26, 2022
Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

One-Shot Free-View Neural Talking Head Synthesis Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Vide

ZLH 406 Dec 23, 2022
An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Self-Supervised Graph Representation Learning via Topology Transformations This repository is the official PyTorch implementation of the following pap

Hsiang Gao 2 Oct 31, 2022
Hcaptcha-challenger - Gracefully face hCaptcha challenge with Yolov5(ONNX) embedded solution

hCaptcha Challenger 🚀 Gracefully face hCaptcha challenge with Yolov5(ONNX) embe

593 Jan 03, 2023
【ACMMM 2021】DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning (ACMMM 2021) Overview We release the code of the DSANet (Dynamic S

Wenhao Wu 46 Dec 27, 2022
Parametric Contrastive Learning (ICCV2021)

Parametric-Contrastive-Learning This repository contains the implementation code for ICCV2021 paper: Parametric Contrastive Learning (https://arxiv.or

DV Lab 156 Dec 21, 2022
The official code repository for examples in the O'Reilly book 'Generative Deep Learning'

Generative Deep Learning Teaching Machines to paint, write, compose and play The official code repository for examples in the O'Reilly book 'Generativ

David Foster 1.3k Dec 29, 2022
Object detection GUI based on PaddleDetection

PP-Tracking GUI界面测试版 本项目是基于飞桨开源的实时跟踪系统PP-Tracking开发的可视化界面 在PaddlePaddle中加入pyqt进行GUI页面研发,可使得整个训练过程可视化,并通过GUI界面进行调参,模型预测,视频输出等,通过多种类型的识别,简化整体预测流程。 GUI界面

杨毓栋 68 Jan 02, 2023
Official implementation of Self-supervised Image-to-text and Text-to-image Synthesis

Self-supervised Image-to-text and Text-to-image Synthesis This is the official implementation of Self-supervised Image-to-text and Text-to-image Synth

6 Jul 31, 2022
UV matrix decompostion using movielens dataset

UV-matrix-decompostion-with-kfold UV matrix decompostion using movielens dataset upload the 'ratings.dat' file install the following python libraries

2 Oct 18, 2022