DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Last update: Jan 01, 2023

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Created by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh

This repository contains PyTorch implementation for DynamicViT.

We introduce a dynamic token sparsification framework to prune redundant tokens in vision transformers progressively and dynamically based on the input:

Our code is based on pytorch-image-models, DeiT and LV-ViT

[Project Page] [arXiv]

Model Zoo

We provide our DynamicViT models pretrained on ImageNet:

name	arch	rho	[email protected]	[email protected]	FLOPs	url
DynamicViT-256/0.7	`deit_256`	0.7	76.532	93.118	1.3G	Google Drive / Tsinghua Cloud
DynamicViT-384/0.7	`deit_small`	0.7	79.316	94.676	2.9G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.5	`lvvit_s`	0.5	81.970	95.756	3.7G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.7	`lvvit_s`	0.7	83.076	96.252	4.6G	Google Drive / Tsinghua Cloud
DynamicViT-LV-M/0.7	`lvvit_m`	0.7	83.816	96.584	8.5G	Google Drive / Tsinghua Cloud

Usage

Requirements

torch>=1.7.0
torchvision>=0.8.1
timm==0.4.5

Data preparation: download and extract ImageNet images from http://image-net.org/. The directory structure should be

│ILSVRC2012/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Model preparation: download pre-trained DeiT and LV-ViT models for training DynamicViT:

sh download_pretrain.sh

Demo

We provide a Jupyter notebook where you can run the visualization of DynamicViT.

To run the demo, you need to install matplotlib.

Evaluation

To evaluate a pre-trained DynamicViT model on ImageNet val with a single GPU, run:

python infer.py --data-path /path/to/ILSVRC2012/ --arch arch_name --model-path /path/to/model --base_rate 0.7

Training

To train DynamicViT models on ImageNet, run:

DeiT-small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_deit-small --arch deit_small --input-size 224 --batch-size 96 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-S

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-s --arch lvvit_s --input-size 224 --batch-size 64 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-M

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-m --arch lvvit_m --input-size 224 --batch-size 48 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

You can train models with different keeping ratio by adjusting base_rate. DynamicViT can also achieve comparable performance with only 15 epochs training (around 0.1% lower accuracy).

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@article{rao2021dynamicvit,
  title={DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification},
  author={Rao, Yongming and Zhao, Wenliang and Liu, Benlin and Lu, Jiwen and Zhou, Jie and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2106.02034},
  year={2021}
}

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Related tags

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Model Zoo

Usage

Requirements

Demo

Evaluation

Training

License

Citation

Owner

Yongming Rao

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Codes for the AAAI'22 paper "TransZero: Attribute-guided Transformer for Zero-Shot Learning"

The official start-up code for paper "FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark."

Recursive Bayesian Networks

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or simply to separate onnx files to any size you want.

"SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image", Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Humphrey Shi, Zhangyang Wang

68 keypoint annotations for COFW test data

Deep Implicit Moving Least-Squares Functions for 3D Reconstruction

[SDM 2022] Towards Similarity-Aware Time-Series Classification

Molecular AutoEncoder in PyTorch

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

SCU OlympicsRunning Baseline

Python implementation of "Elliptic Fourier Features of a Closed Contour"

Representing Long-Range Context for Graph Neural Networks with Global Attention

CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track.