Vision Transformer with Deformable Attention

This repository contains the code for the paper Vision Transformer with Deformable Attention [arXiv].

Introduction

Deformable attention is proposed to model the relations among tokens effectively under the guidance of the important regions in the feature maps. This flexible scheme enables the self-attention module to focus on relevant regions and capture more informative features. On this basis, we present Deformable Attention Transformer (DAT), a general backbone model with deformable attention for both image classification and other dense prediction tasks.

Dependencies

NVIDIA GPU + CUDA 11.1
Python 3.8 (Recommend to use Anaconda)
PyTorch == 1.8.0
timm
einops
yacs
termcolor

TODO

Classification pretrained models.
Object Detection codebase & models.
Semantic Segmentation codebase & models.
CUDA operators to accelerate sampling operations.

Acknowledgement

This code is developed on the top of Swin Transformer, we thank to their efficient and neat codebase.

Citation

If you find our work is useful in your research, please consider citing:

@misc{xia2022vision,
      title={Vision Transformer with Deformable Attention}, 
      author={Zhuofan Xia and Xuran Pan and Shiji Song and Li Erran Li and Gao Huang},
      year={2022},
      eprint={2201.00520},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contact

[email protected]

Repository of Vision Transformer with Deformable Attention

Related tags

Overview

Vision Transformer with Deformable Attention

Introduction

Dependencies

TODO

Acknowledgement

Citation

Contact

Owner

ivadomed is an integrated framework for medical image analysis with deep learning.

Self-Supervised Contrastive Learning of Music Spectrograms

Mesh TensorFlow: Model Parallelism Made Easier

This is project is the implementation of the DeepShift: Towards Multiplication-Less Neural Networks paper

An Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering

Official code repository for the EMNLP 2021 paper

Computer Vision Script to recognize first person motion, developed as final project for the course "Machine Learning and Deep Learning"

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Official code repository for the work: "The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement"

Gym for multi-agent reinforcement learning

A copy of Ares that costs 30 fucking dollars.

Implementation of Nyström Self-attention, from the paper Nyströmformer

3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

Swapping face using Face Mesh with TensorFlow Lite

Type4Py: Deep Similarity Learning-Based Type Inference for Python

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Pytorch implementation for "Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter".

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".