This is an official implementation for "Video Swin Transformers".

Last update: Jan 03, 2023

Overview

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics 400

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-T	ImageNet-1K	30ep	224	78.8	93.6	28M	87.9G	config	github/baidu
Swin-S	ImageNet-1K	30ep	224	80.6	94.5	50M	165.9G	config	github/baidu
Swin-B	ImageNet-1K	30ep	224	80.6	94.6	88M	281.6G	config	github/baidu
Swin-B	ImageNet-22K	30ep	224	82.7	95.5	88M	281.6G	config	github/baidu

Kinetics 600

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	ImageNet-22K	30ep	224	84.0	96.5	88M	281.6G	config	github/baidu

Something-Something V2

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	Kinetics 400	60ep	224	69.6	92.7	89M	320.6G	config	github/baidu

Notes:

Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
The pre-trained model of SSv2 could be downloaded at github/baidu.
Access code for baidu is swin.

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

This is an official implementation for "Video Swin Transformers".

Related tags

Overview

Video Swin Transformer

Updates

Introduction

Results and Models

Kinetics 400

Kinetics 600

Something-Something V2

Usage

Installation

Data Preparation

Inference

Training

Apex (optional):

Citation

Other Links

Owner

Swin Transformer

Near-Duplicate Video Retrieval with Deep Metric Learning

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Deep Q-network learning to play flappybird.

RIM: Reliable Influence-based Active Learning on Graphs.

PPO is a very popular Reinforcement Learning algorithm at present.

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Tree-based Search Graph for Approximate Nearest Neighbor Search

Contenido del curso Bases de datos del DCC PUC versión 2021-2

Library of various Few-Shot Learning frameworks for text classification

Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks

Turning pixels into virtual points for multimodal 3D object detection.

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

A variational Bayesian method for similarity learning in non-rigid image registration (CVPR 2022)

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Customizable RecSys Simulator for OpenAI Gym

CLIP+FFT text-to-image

Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction