Video Swin Transformer - PyTorch

Last update: Dec 20, 2022

Overview

Video-Swin-Transformer-Pytorch

This repo is a simple usage of the official implementation "Video Swin Transformer".

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Usage

Installation

pip install -r requirements.txt

If this does not work, please refer to the official install.md for installation.

Prepare

git clone https://github.com/haofanwang/video-swin-transformer-pytorch.git

cd video-swin-transformer-pytorch
mkdir checkpoints && cd checkpoints
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_base_patch244_window1677_sthv2.pth
cd ..

If you want to try different models, please refer to Video-Swin-Transformer and download corresponding pretrained weight, then modify the config and pretrained weight.

Inference

import torch
import torch.nn as nn
from video_swin_transformer import SwinTransformer3D

model = SwinTransformer3D()
print(model)

dummy_x = torch.rand(1, 3, 32, 224, 224)
logits = model(dummy_x)
print(logits.shape)

python example.py

Acknowledgement

The code is adapted from the official Video-Swin-Transformer repository. This project is inspired by swin-transformer-pytorch, which provides the simplest code to get started.

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Video Swin Transformer - PyTorch

Related tags

Overview

Video-Swin-Transformer-Pytorch

Introduction

Usage

Installation

Prepare

Inference

Acknowledgement

Citation

Owner

Haofan Wang

FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Joint learning of images and text via maximization of mutual information

Deploy recommendation engines with Edge Computing

This is the dataset for testing the robustness of various VO/VIO methods

Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 short.

The implementation of "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement"

To SMOTE, or not to SMOTE?

FedGS: A Federated Group Synchronization Framework Implemented by LEAF-MX.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe.

Neural style transfer in PyTorch.

This repository accompanies the ACM TOIS paper "What can I cook with these ingredients?" - Understanding cooking-related information needs in conversational search

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving

BlueFog Tutorials

Data for "Driving the Herd: Search Engines as Content Influencers" paper

Cross-Document Coreference Resolution

Code for the paper "Improved Techniques for Training GANs"

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation