Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Last update: Jan 01, 2023

Overview

Less is More: Pay Less Attention in Vision Transformers

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

By Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu and Jianfei Cai.

In our paper, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. LIT uses pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner.

If you use this code for a paper please cite:

@article{pan2021less,
  title={Less is More: Pay Less Attention in Vision Transformers},
  author={Pan, Zizheng and Zhuang, Bohan and He, Haoyu and Liu, Jing and Cai, Jianfei},
  journal={arXiv preprint arXiv:2105.14217},
  year={2021}
}

Usage

First, clone this repository.

git clone https://github.com/MonashAI/LIT

Next, create a conda virtual environment.

# Make sure you have a NVIDIA GPU.
cd LIT/
bash setup_env.sh [conda_install_path] [env_name]

# For example
bash setup_env.sh /home/anaconda3 lit

Note: We use PyTorch 1.7.1 with CUDA 10.1 for all experiments. The setup_env.sh has illustrated all dependencies we used in our experiments. You may want to edit this file to install a different version of PyTorch or any other packages.

Data Preparation

Download the ImageNet 2012 dataset from here, and prepare the dataset based on this script. The file structure should look like:

imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Model Zoo

We provide baseline LIT models pretrained on ImageNet 2012.

Name	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Model	Log
LIT-Ti	19	3.6	81.1	google drive/github	log
LIT-S	27	4.1	81.5	google drive/github	log
LIT-M	48	8.6	83.0	google drive/github	log
LIT-B	86	15.0	83.4	google drive/github	log

Training and Evaluation

In our implementation, we have different training strategies for LIT-Ti and other LIT models. Therefore, we provide two codebases.

For LIT-Ti, please refer to code_for_lit_ti.

For LIT-S, LIT-M, LIT-B, please refer to code_for_lit_s_m_b.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

This repository has adopted codes from DeiT, PVT and Swin, we thank the authors for their open-sourced code.

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

175 Jan 8, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

168 Dec 29, 2022

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

87 Dec 19, 2022

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

207 Dec 23, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

193 Jan 3, 2023

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

127 Dec 23, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Comments

Problem about DCN

I have some problems about compiling DCN, thus it is hard for me to use the DCN.deform_conv2d_forward and DCN.deform_conv2d_backward functions. Can 'deform_conv2d_naive' be used instead of this part? Or is there other methods for me to accomplish this DCN.deform_conv2d part?

opened by jarygrace 3
How to use LITNet as a beckbone of object detection

Could you please release the code of using LITNet as a beckbone of RetinaNet as mentioned in your paper? I wonder how to use it as a beckbone for object detection...

opened by sunhuisunhui 2

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Related tags

Overview

Less is More: Pay Less Attention in Vision Transformers

Usage

Data Preparation

Model Zoo

Training and Evaluation

License

Acknowledgement

You might also like...

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Comments

Problem about DCN

How to use LITNet as a beckbone of object detection

Releases(v2.1)

v2.1(Mar 10, 2022)

v2.0(Jun 26, 2021)

v1.0(Jun 8, 2021)

Owner

(JMLR'19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

PyTorch implementation of DirectCLR from paper Understanding Dimensional Collapse in Contrastive Self-supervised Learning

[CVPR2021] De-rendering the World's Revolutionary Artefacts

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

SegNet-like Autoencoders in TensorFlow

A library for building and serving multi-node distributed faiss indices.

Project ArXiv Citation Network

《A-CNN: Annularly Convolutional Neural Networks on Point Clouds》(2019)

Image Segmentation Evaluation

😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

Playable Video Generation

Label-Free Model Evaluation with Semi-Structured Dataset Representations

利用yolov5和TensorRT从0到1实现目标检测的模型训练到模型部署全过程

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

Automatically creates genre collections for your Plex media

Source code for CVPR 2020 paper "Learning to Forget for Meta-Learning"

U-2-Net: U Square Net - Modified for paired image training of style transfer