Motionformer

This is an official pytorch implementation of paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In this repository, we provide PyTorch code for training and testing our proposed Motionformer model. Motionformer use proposed trajectory attention to achieve state-of-the-art results on several video action recognition benchmarks such as Kinetics-400 and Something-Something V2.

If you find Motionformer useful in your research, please use the following BibTeX entry for citation.

@misc{patrick2021keeping,
      title={Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers}, 
      author={Mandela Patrick and Dylan Campbell and Yuki M. Asano and Ishan Misra Florian Metze and Christoph Feichtenhofer and Andrea Vedaldi and Jo\ão F. Henriques},
      year={2021},
      eprint={2106.05392},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Model Zoo

We provide Motionformer models pretrained on Kinetics-400 (K400), Kinetics-600 (K600), Something-Something-V2 (SSv2), and Epic-Kitchens datasets.

name	dataset	# of frames	spatial crop	[email protected]	[email protected]	url
Joint	K400	16	224	79.2	94.2	model
Divided	K400	16	224	78.5	93.8	model
Motionformer	K400	16	224	79.7	94.2	model
Motionformer-HR	K400	16	336	81.1	95.2	model
Motionformer-L	K400	32	224	80.2	94.8	model

name	dataset	# of frames	spatial crop	[email protected]	[email protected]	url
Motionformer	K600	16	224	81.6	95.6	model
Motionformer-HR	K600	16	336	82.7	96.1	model
Motionformer-L	K600	32	224	82.2	96.0	model

name	dataset	# of frames	spatial crop	[email protected]	[email protected]	url
Joint	SSv2	16	224	64.0	88.4	model
Divided	SSv2	16	224	64.2	88.6	model
Motionformer	SSv2	16	224	66.5	90.1	model
Motionformer-HR	SSv2	16	336	67.1	90.6	model
Motionformer-L	SSv2	32	224	68.1	91.2	model

name	dataset	# of frames	spatial crop	A acc	N acc	url
Motionformer	EK	16	224	43.1	56.5	model
Motionformer-HR	EK	16	336	44.5	58.5	model
Motionformer-L	EK	32	224	44.1	57.6	model

Installation

First, create a conda virtual environment and activate it:

conda create -n motionformer python=3.8.5 -y
source activate motionformer

Then, install the following packages:

torchvision: pip install torchvision or conda install torchvision -c pytorch
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
simplejson: pip install simplejson
einops: pip install einops
timm: pip install timm
PyAV: conda install av -c conda-forge
psutil: pip install psutil
scikit-learn: pip install scikit-learn
OpenCV: pip install opencv-python
tensorboard: pip install tensorboard
matplotlib: pip install matplotlib
pandas: pip install pandas
ffmeg: pip install ffmpeg-python

OR:

simply create conda environment with all packages just from yaml file:

conda env create -f environment.yml

Lastly, build the Motionformer codebase by running:

git clone https://github.com/facebookresearch/Motionformer
cd Motionformer
python setup.py build develop

Usage

Dataset Preparation

Please use the dataset preparation instructions provided in DATASET.md.

Training the Default Motionformer

Training the default Motionformer that uses trajectory attention, and operates on 16-frame clips cropped at 224x224 spatial resolution, can be done using the following command:

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

You may need to pass location of your dataset in the command line by adding DATA.PATH_TO_DATA_DIR path_to_your_dataset, or you can simply modify

DATA:
  PATH_TO_DATA_DIR: path_to_your_dataset

To the yaml configs file, then you do not need to pass it to the command line every time.

Using a Different Number of GPUs

If you want to use a smaller number of GPUs, you need to modify .yaml configuration files in configs/. Specifically, you need to modify the NUM_GPUS, TRAIN.BATCH_SIZE, TEST.BATCH_SIZE, DATA_LOADER.NUM_WORKERS entries in each configuration file. The BATCH_SIZE entry should be the same or higher as the NUM_GPUS entry.

Using Different Self-Attention Schemes

If you want to experiment with different space-time self-attention schemes, e.g., joint space-time attention or divided space-time attention, use the following commands:

python tools/run_net.py \
  --cfg configs/K400/joint_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/K400/divided_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Training Different Motionformer Variants

If you want to train more powerful Motionformer variants, e.g., Motionformer-HR (operating on 16-frame clips sampled at 336x336 spatial resolution), and Motionformer-L (operating on 32-frame clips sampled at 224x224 spatial resolution), use the following commands:

python tools/run_net.py \
  --cfg configs/K400/motionformer_336_16x8.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_32x3.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Note that for these models you will need a set of GPUs with ~32GB of memory.

Inference

Use TRAIN.ENABLE and TEST.ENABLE to control whether training or testing is required for a given run. When testing, you also have to provide the path to the checkpoint model via TEST.CHECKPOINT_FILE_PATH.

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Alterantively, you can modify provided SLURM script and run following:

sbatch slurm_scripts/test.sh configs/K400/motionformer_224_16x4.yaml path_to_your_checkpoint

Single-Node Training via Slurm

To train Motionformer via Slurm, please check out our single node Slurm training script slurm_scripts/run_single_node_job.sh.

sbatch slurm_scripts/run_single_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/

Multi-Node Training via Submitit

Distributed training is available via Slurm and submitit

pip install submitit

To train Motionformer model on Kinetics using 8 nodes with 8 gpus each use the following command:

python run_with_submitit.py --cfg configs/K400/motionformer_224_16x4.yaml --job_dir  /your/job/dir/${JOB_NAME}/ --partition $PARTITION --num_shards 8 --use_volta32

We provide a script for launching slurm jobs in slurm_scripts/run_multi_node_job.sh.

sbatch slurm_scripts/run_multi_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/

Please note that hyper-parameters in configs were used with 8 nodes with 8 gpus (32 GB). Please scale batch-size, and learning-rate appropriately for your cluster configuration.

Finetuning

To finetune from an existing PyTorch checkpoint add the following line in the command line, or you can also add it in the YAML config:

TRAIN.CHECKPOINT_EPOCH_RESET: True
TRAIN.CHECKPOINT_FILE_PATH path_to_your_PyTorch_checkpoint

Environment

The code was developed using python 3.8.5 on Ubuntu 20.04. For training, we used eight GPU compute nodes each node containing 8 Tesla V100 GPUs (32 GPUs in total). Other platforms or GPU cards have not been fully tested.

License

The majority of this work is licensed under CC-NC 4.0 International license. However, portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license.

Contributing

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Acknowledgements

Motionformer is built on top of PySlowFast, Timesformer and pytorch-image-models by Ross Wightman. We thank the authors for releasing their code. If you use our model, please consider citing these works as well:

@misc{fan2020pyslowfast,
  author =       {Haoqi Fan and Yanghao Li and Bo Xiong and Wan-Yen Lo and
                  Christoph Feichtenhofer},
  title =        {PySlowFast},
  howpublished = {\url{https://github.com/facebookresearch/slowfast}},
  year =         {2020}
}

@inproceedings{gberta_2021_ICML,
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    title = {Is Space-Time Attention All You Need for Video Understanding?},
    booktitle   = {Proceedings of the International Conference on Machine Learning (ICML)}, 
    month = {July},
    year = {2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Related tags

Overview

Motionformer

Model Zoo

Installation

Usage

Dataset Preparation

Training the Default Motionformer

Using a Different Number of GPUs

Using Different Self-Attention Schemes

Training Different Motionformer Variants

Inference

Single-Node Training via Slurm

Multi-Node Training via Submitit

Finetuning

Environment

License

Contributing

Acknowledgements

Owner

Facebook Research

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

Bayes-Newton—A Gaussian process library in JAX, with a unifying view of approximate Bayesian inference as variants of Newton's algorithm.

This repository contains the implementation of Deep Detail Enhancment for Any Garment proposed in Eurographics 2021

NeurIPS-2021: Neural Auto-Curricula in Two-Player Zero-Sum Games.

Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

curl-impersonate: A special compilation of curl that makes it impersonate Chrome & Firefox

Unsupervised Representation Learning by Invariance Propagation

Code base for reproducing results of I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics. NeurIPS (2021)

FFTNet vocoder implementation

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

用强化学习DQN算法，训练AI模型来玩合成大西瓜游戏，提供Keras版本和PARL（paddle）版本

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

Explaining neural decisions contrastively to alternative decisions.

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

Neural Fixed-Point Acceleration for Convex Optimization