As-ViT: Auto-scaling Vision Transformers without Training

Last update: Sep 05, 2022

Overview

As-ViT: Auto-scaling Vision Transformers without Training [PDF]

Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou

In ICLR 2022.

Note: We implemented topology search (sec. 3.3) and scaling (sec. 3.4) in this code base in PyTorch. Our training code is based on Tensorflow and Keras on TPU, which will be released soon.

Overview

We present As-ViT, a framework that unifies the automatic architecture design and scaling for ViT (vision transformer), in a training-free strategy.

Highlights:

Trainig-free ViT Architecture Design: we design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by our comprehensive study of ViT's network complexity (length distorsion), yielding a strong Kendall-tau correlation with ground-truth accuracies.
Trainig-free ViT Architecture Scaling: starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This will generate a series of architectures with different numbers of parameters in a single run.
Efficient ViT Training via Progressive Tokenization: we observe that ViTs can tolerate coarse tokenization in early training stages, and further propose to train ViTs faster and cheaper with a progressive tokenization strategy.

Left: Length Distortion shows a strong correlation with ViT's accuracy. Middle: Auto scaling rule of As-ViT. Right: Progressive re-tokenization for efficient ViT training.

Prerequisites

Ubuntu 18.04
Python 3.6.9
CUDA 11.0 (lower versions may work but were not tested)
NVIDIA GPU + CuDNN v7.6

This repository has been tested on V100 GPU. Configurations may need to be changed on different platforms.

Installation

Clone this repo:

git clone https://github.com/VITA-Grou/AsViT.git
cd AsViT

Install dependencies:

pip install -r requirements.txt

1. Seed As-ViT Topology Search

CUDA_VISIBLE_DEVICES=0 python ./search/reinforce.py --save_dir ./output/REINFORCE-imagenet --data_path /path/to/imagenet

This job will return you a seed topology. For example, our search seed topology is 8,2,3|4,1,2|4,1,4|4,1,6|32, which can be explained as below:

Stage1			Stage2			Stage3			Stage4			Head
Kernel K1	Split S1	Expansion E1	Kernel K2	Split S2	Expansion E2	Kernel K3	Split S3	Expansion E3	Kernel K4	Split S4	Expansion E4	Head
8	2	3	4	1	2	4	1	4	4	1	6	32

2. Scaling

CUDA_VISIBLE_DEVICES=0 python ./search/grow.py --save_dir ./output/GROW-imagenet \
--arch "[arch]" --data_path /path/to/imagenet

Here [arch] is the seed topology (output from step 1 above). This job will return you a series of topologies. For example, our largest topology (As-ViT Large) is 8,2,3,5|4,1,2,2|4,1,4,5|4,1,6,2|32,180, which can be explained as below:

Stage1				Stage2				Stage3				Stage4				Head	Initial Hidden Size
Kernel K1	Split S1	Expansion E1	Layers L1	Kernel K2	Split S2	Expansion E2	Layers L2	Kernel K3	Split S3	Expansion E3	Layers L3	Kernel K4	Split S4	Expansion E4	Layers L4	Head	Initial Hidden Size
8	2	3	5	4	1	2	2	4	1	4	5	4	1	6	2	32	180

3. Evaluation

Tensorflow and Keras code for training on TPU. To be released soon.

Citation

@inproceedings{chen2021asvit,
  title={Auto-scaling Vision Transformers without Training},
  author={Chen, Wuyang and Huang, Wei and Du, Xianzhi and Song, Xiaodan and Wang, Zhangyang and Zhou, Denny},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

As-ViT: Auto-scaling Vision Transformers without Training

Related tags

Overview

As-ViT: Auto-scaling Vision Transformers without Training [PDF]

Overview

Prerequisites

Installation

1. Seed As-ViT Topology Search

2. Scaling

3. Evaluation

Citation

Owner

VITA

Create time-series datacubes for supervised machine learning with ICEYE SAR images.

Facestar dataset. High quality audio-visual recordings of human conversational speech.

TensorFlow implementation of original paper : https://github.com/hszhao/PSPNet

[IJCAI'21] Deep Automatic Natural Image Matting

Deep deconfounded recommender (Deep-Deconf) for paper "Deep causal reasoning for recommendations"

deep learning for image processing including classification and object-detection etc.

This is the code related to "Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation" (ICCV 2021).

Beyond imagenet attack (accepted by ICLR 2022) towards crafting adversarial examples for black-box domains.

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

Canonical Capsules: Unsupervised Capsules in Canonical Pose (NeurIPS 2021)

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Code for "OctField: Hierarchical Implicit Functions for 3D Modeling (NeurIPS 2021)"

This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

Joint Unsupervised Learning (JULE) of Deep Representations and Image Clusters.

This GitHub repo consists of Code and Some results of project- Diabetes Treatment using Gold nanoparticles. These Consist of ML Models used for prediction Diabetes and further the basic theory and working of Gold nanoparticles.

Image Segmentation Evaluation

[ICCV21] Code for RetrievalFuse: Neural 3D Scene Reconstruction with a Database

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Code for Understanding Pooling in Graph Neural Networks

Automatic tool focused on deriving metallicities of open clusters