Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Code for SALT: Stackelberg Adversarial Regularization, EMNLP 2021.

High frequency AI based algorithmic trading module.

This repo is the official implementation of "L2ight: Enabling On-Chip Learning for Optical Neural Networks via Efficient in-situ Subspace Optimization".

StyleGAN2 with adaptive discriminator augmentation (ADA) - Official TensorFlow implementation

sssegmentation is a general framework for our research on strongly supervised semantic segmentation.

[CVPR 2021] Unsupervised 3D Shape Completion through GAN Inversion

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

Whisper is a file-based time-series database format for Graphite.

PyTorch implementation of Super SloMo by Jiang et al.

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Tensorflow implementation of Fully Convolutional Networks for Semantic Segmentation

Read number plates with https://platerecognizer.com/

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Additional functionality for use with fastai’s medical imaging module

Code for BMVC2021 paper "Boundary Guided Context Aggregation for Semantic Segmentation"

Learning High-Speed Flight in the Wild

Object tracking using YOLO and a tracker(KCF, MOSSE, CSRT) in openCV

This repository is a series of notebooks that show solutions for the projects at Dataquest.io.