VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Related tags

Deep Learningvimpac
Overview

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Authors: Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Data Preprocessing

Please refer to video2token folder for the detailed README file.

For pre-training, the dataset is usually large, and we suggest to use FPS=2 during extraction. For downstream tasks, we suggest using FPS=16 that enables a higher frame rate for short videos.

We recommend to store the data locally at data/video_tokens. If different paths are used, please specify the path of VIDEO_CODE_PATHS and VIDEO_ANNO_PATHS in vimpac/data.py.

Pre-Trained Weights

We provide the pre-trained weights with their links. Please download the pre-trained weight and extract them under snap/.

Pre-Training

The default pre-training uses the HowTo100M dataset. The pre-training data could be switched to Kinetics-700 and other datasets by specifying the --dataset-name argument. We have validated that the mask-then-predict task works reasonablely well on Kinetics-700 datasets. However, the average length of video clips inside K-700 is 10 seconds thus not sure supporting the long-range contrastive learning.

Small Model

We first provide the script to pre-train a small model (6 layers, 512 dimensions, 256 frame-size, and 5 clip length):

bash scripts/pretrain/small.sh 0,1,2,3

We here annotate some essential arguments inside the pre-training scripts. For a full descriptions for all the arguments, please check param.py

We also provide two debugging options:

# bash scripts/pretrain/small.sh 0,1,2,3 --tqdm        # Show progress bar.
# bash scripts/pretrain/small.sh 0,1,2,3 --debug       # Only run a few steps per epoch.

Large Model

We follow BERT to pre-train our large model in two stages. The first stage pretrains for 90 epochs using frame-size 128 and clip-length 5. The second stage pretrains for 10 epochs using frame-size 256 and clip-length 5.

Scripts for the first stage:

bash scripts/pretrain/large.sh 0,1,2,3

Then we could directly run the script for the second stage without any further changes. It will load the last snapshot from the first stage, do interpolation for larger spatial size, and continue pre-training.

bash scripts/pretrain/large_frame256cont.sh 0,1,2,3

Fine-Tuning

After run the pre-training in pre-training or download the pre-trained weights from pre-trained-weights, we fine-tune the models on several downstream tasks. The arguments in these scripts are consistent with the hyperparameters in the paper. Please refer to Table 11 and Table 12 of our paper for a detailed list of all these hyperparameters.

SSV2

bash scripts/finetune/small_ssv2.sh 0,1,2,3

Diving48

bash scripts/finetune/small_diving48.sh 0,1,2,3

UCF101

bash scripts/finetune/small_ucf101.sh 0,1,2,3

HMDB51

bash scripts/finetune/small_hmdb51.sh 0,1,2,3

Change the Input Shape

Following ViT, we support the use of different input sizes from pre-training by interpolating the positional embedding. This is done by passing the --different-shape option. Otherwise, an error will pop up if the fine-tuning input shape is different from the pre-training. A larger input shape generally improves the results. We here take SSV2 as an example.

Longer clip length (10; default 5):

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2)

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2) + larger input size (256; default 128). Please also make sure that VQ-VAE code with input-size 256 has been extracted as in Pre-processing.

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Large Models

We provide scripts to run large models. Frame 128:

bash scripts/finetune/large_frame128_ucf101.sh 0,1,2,3

Frame 256:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3

The input shape could be changed as in change input shape. Our final model use the scripts of:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Acknowledgement

This work was granted access to the HPC resources of IDRIS under the allocation 20XX-AD011011621R1 made by GENCI. We thank Teven Le Scao and Victor Sanh for their help on the way.

Owner
Hao Tan
NLP @ UNC Chapel Hill
Hao Tan
Unpaired Caricature Generation with Multiple Exaggerations

CariMe-pytorch The official pytorch implementation of the paper "CariMe: Unpaired Caricature Generation with Multiple Exaggerations" CariMe: Unpaired

Gu Zheng 37 Dec 30, 2022
Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Orthogonalizing Convolutional Layers with the Cayley Transform This repository contains implementations and source code to reproduce experiments for t

CMU Locus Lab 36 Dec 30, 2022
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 07, 2022
This is a Keras-based Python implementation of DeepMask- a complex deep neural network for learning object segmentation masks

NNProject - DeepMask This is a Keras-based Python implementation of DeepMask- a complex deep neural network for learning object segmentation masks. Th

189 Nov 16, 2022
GANsformer: Generative Adversarial Transformers Drew A

GANformer: Generative Adversarial Transformers Drew A. Hudson* & C. Lawrence Zitnick Update: We released the new GANformer2 paper! *I wish to thank Ch

Drew Arad Hudson 1.2k Jan 02, 2023
Progressive Coordinate Transforms for Monocular 3D Object Detection

Progressive Coordinate Transforms for Monocular 3D Object Detection This repository is the official implementation of PCT. Introduction In this paper,

58 Nov 06, 2022
Code accompanying our paper Feature Learning in Infinite-Width Neural Networks

Empirical Experiments in "Feature Learning in Infinite-width Neural Networks" This repo contains code to replicate our experiments (Word2Vec, MAML) in

Edward Hu 37 Dec 14, 2022
Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Realistic Full-Body Anonymization with Surface-Guided GANs This is the official

Håkon Hukkelås 30 Nov 18, 2022
DilatedNet in Keras for image segmentation

Keras implementation of DilatedNet for semantic segmentation A native Keras implementation of semantic segmentation according to Multi-Scale Context A

303 Mar 15, 2022
Async API for controlling Hue Lights

Hue API Async API for controlling Hue Lights Documentation: hue-api.nirantak.com Source: github.com/nirantak/hue-api Installation This is an async cli

Nirantak Raghav 4 Nov 16, 2022
This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

Feedback Prize - Evaluating Student Writing This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing. The

Udbhav Bamba 41 Dec 14, 2022
Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

Source code for CAST: Crisis Domain Adaptation UsingSequence-to-sequenceTransformers (Paper, BibTeX, Accepted to ISCRAM 2021, CorePaper) Quick start D

Congcong Wang 0 Jul 14, 2021
efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

Ben Hayes 169 Dec 23, 2022
On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks

On Size-Oriented Long-Tailed Graph Classification of Graph Neural Networks We provide the code (in PyTorch) and datasets for our paper "On Size-Orient

Zemin Liu 4 Jun 18, 2022
High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features

CleanRL (Clean Implementation of RL Algorithms) CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation

Costa Huang 1.8k Jan 01, 2023
Official Pytorch implementation of ICLR 2018 paper Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge.

Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge: Official Pytorch implementation of ICLR 2018 paper Deep Learning for Phy

emmanuel 47 Nov 06, 2022
Rapid experimentation and scaling of deep learning models on molecular and crystal graphs.

LitMatter A template for rapid experimentation and scaling deep learning models on molecular and crystal graphs. How to use Clone this repository and

Nathan Frey 32 Dec 06, 2022
Learning Time-Critical Responses for Interactive Character Control

Learning Time-Critical Responses for Interactive Character Control Abstract This code implements the paper Learning Time-Critical Responses for Intera

Movement Research Lab 227 Dec 31, 2022
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Squirrel Core Share, load, and transform data in a collaborative, flexible, and efficient way What is Squirrel? Squirrel is a Python library that enab

Merantix Momentum 249 Dec 07, 2022
Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

Numenta 6.3k Dec 30, 2022