XViT - Space-time Mixing Attention for Video Transformer

Overview

XViT - Space-time Mixing Attention for Video Transformer

This is the official implementation of the XViT paper:

@inproceedings{bulat2021space,
  title={Space-time Mixing Attention for Video Transformer},
  author={Bulat, Adrian and Perez-Rua, Juan-Manuel and Sudhakaran, Swathikiran and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={NeurIPS},
  year={2021}
}

In XViT, we introduce a novel Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. Our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time is significantly more efficient than other Video Transformer models.

Attention pattern

Model Zoo

We provide a series of models pre-trained on Kinetics-600 and Something-Something-v2.

Kinetics-600

Architecture frames views Top-1 Top-5 url
XViT-B16 16 3x1 84.51% 96.26% model
XViT-B16 16 3x2 84.71% 96.39% model

Something-Something-V2

Architecture frames views Top-1 Top-5 url
XViT-B16 16 32x2 67.19% 91.00% model

Installation

Please make sure your setup satisfies the following requirements:

Requirements

Largely follows the original SlowFast repo requirements:

  • Python >= 3.8
  • Numpy
  • PyTorch >= 1.3
  • hdf5
  • fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
  • torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
  • simplejson: pip install simplejson
  • GCC >= 4.9
  • PyAV: conda install av -c conda-forge
  • ffmpeg (4.0 is prefereed, will be installed along with PyAV)
  • PyYaml: (will be installed along with fvcore)
  • tqdm: (will be installed along with fvcore)
  • iopath: pip install -U iopath or conda install -c iopath iopath
  • psutil: pip install psutil
  • OpenCV: pip install opencv-python
  • torchvision: pip install torchvision or conda install torchvision -c pytorch
  • tensorboard: pip install tensorboard
  • PyTorchVideo: pip install pytorchvideo
  • Detectron2:
    pip install -U torch torchvision cython
    pip install -U 'git+https://github.com/facebookresearch/fvcore.git' 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
    git clone https://github.com/facebookresearch/detectron2 detectron2_repo
    pip install -e detectron2_repo
    # You can find more details at https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md

Datasets

1. Kenetics

You can download Kinetics 400/600 datasets following the instructions provided by the cvdfundation repo: https://github.com/cvdfoundation/kinetics-dataset

Afterwars, resize the videos to the shorte edge size of 256 and prepare the csv files for training, validation in testting: train.csv, val.csv, test.csv. The formatof the csv file is:

path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N

Depending on your system, we recommend decoding the videos to frames and then packing each set of frames into a h5 file with the same name as the original video.

2. Something-Something v2

You can download the datasets from the authors webpage: https://20bn.com/datasets/something-something

Perform the same packing procedure as for Kinetics.

Usage

Training

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset

Evaluation

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Acknowledgements

This repo is built using components from SlowFast and timm

License

XViT code is released under the Apache 2.0 license.

Owner
Adrian Bulat
AI Researcher at Samsung AI, member of the deeplearning cult.
Adrian Bulat
CS50x-AI - Artificial Intelligence with Python from Harvard University

CS50x-AI Artificial Intelligence with Python from Harvard University 📖 Table of

Hosein Damavandi 6 Aug 22, 2022
This repository contains the implementation of the following paper: Cross-Descriptor Visual Localization and Mapping

Cross-Descriptor Visual Localization and Mapping This repository contains the implementation of the following paper: "Cross-Descriptor Visual Localiza

Mihai Dusmanu 81 Oct 06, 2022
This repository contains the source codes for the paper AtlasNet V2 - Learning Elementary Structures.

AtlasNet V2 - Learning Elementary Structures This work was build upon Thibault Groueix's AtlasNet and 3D-CODED projects. (you might want to have a loo

Théo Deprelle 123 Nov 11, 2022
ANEA: Distant Supervision for Low-Resource Named Entity Recognition

ANEA: Distant Supervision for Low-Resource Named Entity Recognition ANEA is a tool to automatically annotate named entities in unlabeled text based on

Saarland University Spoken Language Systems Group 15 Mar 30, 2022
PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition The unofficial code of CDistNet. Now, we ha

25 Jul 20, 2022
Molecular AutoEncoder in PyTorch

MolEncoder Molecular AutoEncoder in PyTorch Install $ git clone https://github.com/cxhernandez/molencoder.git && cd molencoder $ python setup.py insta

Carlos Hernández 80 Dec 05, 2022
One Million Scenes for Autonomous Driving

ONCE Benchmark This is a reproduced benchmark for 3D object detection on the ONCE (One Million Scenes) dataset. The code is mainly based on OpenPCDet.

148 Dec 28, 2022
[CVPR 2022] Deep Equilibrium Optical Flow Estimation

Deep Equilibrium Optical Flow Estimation This is the official repo for the paper Deep Equilibrium Optical Flow Estimation (CVPR 2022), by Shaojie Bai*

CMU Locus Lab 136 Dec 18, 2022
This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

Project This repo has been populated by an initial template to help get you started. Please make sure to update the content to build a great experienc

Microsoft 674 Dec 26, 2022
Official PyTorch implementation and pretrained models of the paper Self-Supervised Classification Network

Self-Classifier: Self-Supervised Classification Network Official PyTorch implementation and pretrained models of the paper Self-Supervised Classificat

Elad Amrani 24 Dec 21, 2022
We present a regularized self-labeling approach to improve the generalization and robustness properties of fine-tuning.

Overview This repository provides the implementation for the paper "Improved Regularization and Robustness for Fine-tuning in Neural Networks", which

NEU-StatsML-Research 21 Sep 08, 2022
Research into Forex price prediction from price history using Deep Sequence Modeling with Stacked LSTMs.

Forex Data Prediction via Recurrent Neural Network Deep Sequence Modeling Research Paper Our research paper can be viewed here Installation Clone the

Alex Taradachuk 2 Aug 07, 2022
Trajectory Extraction of road users via Traffic Camera

Traffic Monitoring Citation The associated paper for this project will be published here as soon as possible. When using this software, please cite th

Julian Strosahl 14 Dec 17, 2022
Implementation of "RaScaNet: Learning Tiny Models by Raster-Scanning Image" from CVPR 2021.

RaScaNet: Learning Tiny Models by Raster-Scanning Images Deploying deep convolutional neural networks on ultra-low power systems is challenging, becau

SAIT (Samsung Advanced Institute of Technology) 5 Dec 26, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

287 Dec 21, 2022
Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

CSRA This is the official code of ICCV 2021 paper: Residual Attention: A Simple But Effective Method for Multi-Label Recoginition Demo, Train and Vali

163 Dec 22, 2022
Adversarial Autoencoders

Adversarial Autoencoders (with Pytorch) Dependencies argparse time torch torchvision numpy itertools matplotlib Create Datasets python create_datasets

Felipe Ducau 188 Jan 01, 2023
Bu repo SAHI uygulamasını mantığını öğreniyoruz.

SAHI-Learn: SAHI'den Beraber Kodlamak İster Misiniz Herkese merhabalar ben Kadir Nar. SAHI kütüphanesine gönüllü geliştiriciyim. Bu repo SAHI kütüphan

Kadir Nar 11 Aug 22, 2022
Code basis for the paper "Camera Condition Monitoring and Readjustment by means of Noise and Blur" (2021)

Camera Condition Monitoring and Readjustment by means of Noise and Blur This repository contains the source code of the paper: Wischow, M., Gallego, G

7 Dec 22, 2022
UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

UpChecker UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down.

Yan 4 Apr 07, 2022