DeepViT

This repo is the official implementation of "DeepViT: Towards Deeper Vision Transformer". The repo is based on the timm library (https://github.com/rwightman/pytorch-image-models) by Ross Wightman

Introduction

Deep Vision Transformer is initially described in arxiv, which observes the attention collapese phenomenon when training deep vision transformers: In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

2. DeepViT Models

Model	Re-attention	Top1 Acc (%)	#params	#Similar Blocks	Checkpoint
ViT-16	NA	78.88	24.5M	5	[here](comming soon)
DeepViT-16	FC	79.10	24.5M	0	[here](comming soon)
ViT-24	NA	79.35	36.3M	11	[here](comming soon)
DeepViT-24	FC	79.99	36.3M	0	[here](comming soon)
ViT-32	NA	79.27	48.1M	15	[here](comming soon)
DeepViT_t-32	FC	80.90	48.1M	0	[here](comming soon)

Citing DeepVit

@article{zhou2021deepvit,
  title={DeepViT: Towards Deeper Vision Transformer},
  author={Zhou, Daquan and Kang, Bingyi and Jin, Xiaojie and Yang, Linjie and Lian, Xiaochen and Hou, Qibin and Feng, Jiashi},
  journal={arXiv preprint arXiv:2103.11886},
  year={2021}
}

《DeepViT: Towards Deeper Vision Transformer》(2021)

Related tags

Overview

DeepViT

Introduction

2. DeepViT Models

Citing DeepVit

Owner

On the adaptation of recurrent neural networks for system identification

2021-AIAC-QQ-Browser-Hyperparameter-Optimization-Rank6

This is code of book "Learn Deep Learning with PyTorch"

This repository implements Douzero's interface to IGCA.

Running Google MoveNet Multipose Tracking models on OpenVINO.

Personal thermal comfort models using digital twins: Preference prediction with BIM-extracted spatial-temporal proximity data from Build2Vec

Code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms.

This repository contains the map content ontology used in narrative cartography

SuperSonic, a new open-source framework to allow compiler developers to integrate RL into compilers easily, regardless of their RL expertise

Framework that uses artificial intelligence applied to mathematical models to make predictions

Neural Dynamic Policies for End-to-End Sensorimotor Learning

RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth, in ICCV 2021 (oral)

Randomized Correspondence Algorithm for Structural Image Editing

Object Tracking and Detection Using OpenCV

GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

Read number plates with https://platerecognizer.com/

Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE)

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

Deep Inside Convolutional Networks - This is a caffe implementation to visualize the learnt model

A really easy-to-use and powerful sudoku solver.