Sequencer: Deep LSTM for Image Classification

Last update: Dec 16, 2022

Related tags

Audio sequencer

Overview

Sequencer: Deep LSTM for Image Classification

Created by

This repository contains implementation for Sequencer.

Abstract

In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.

Schematic diagrams

The overall architecture of Sequencer2D is similar to the typical hierarchical ViT and Visual MLP. It uses Sequencer2D blocks instead of Transformer blocks:

Sequencer2D block replaces the Transformer's self-attention layer with an LSTM-based layer like BiLSTM2D layer:

BiLSTM2D includes a vertical LSTM and a horizontal LSTM:

Model Zoo

We provide our Sequencer models pretrained on ImageNet-1K:

name	arch	Params	FLOPs	[email protected]	download
Sequencer2D-S	`sequencer2d_s`	28M	8.4G	82.3	here
Sequencer2D-M	`sequencer2d_m`	38M	11.1G	82.8	here
Sequencer2D-L	`sequencer2d_l`	54M	16.6G	83.4	here

Usage

Requirements

torch>=1.10.0
torchvision
timm==0.5.4
Pillow
matplotlib
scipy
etc., see requirements.txt

Data preparation

Download and extract ImageNet images. The directory structure should be as follows.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Traning

Command line for training Sequencer models on ImageNet from scratch.

./distributed_train.sh 8 /path/to/imagenet --model sequencer2d_s -b 256 -j 8 --opt adamw --epochs 300 --sched cosine --native-amp --img-size 224 --drop-path 0.1 --lr 2e-3 --weight-decay 0.05 --remode pixel --reprob 0.25 --aa rand-m9-mstd0.5-inc1 --smoothing 0.1 --mixup 0.8 --cutmix 1.0 --warmup-lr 1e-6 --warmup-epochs 20

Command line for fine-tuning a pre-trained model at higher resolution.

./distributed_train.sh 8 /path/to/imagenet --model sequencer2d_l --pretrained -b 64 -j 8 --opt adamw --epochs 30 --sched cosine --native-amp --input-size 3 392 392 --img-size 392 --crop-pct 1.0 --drop-path 0.4 --lr 5e-5 --weight-decay 1e-8 --remode pixel --reprob 0.25 --aa rand-m9-mstd0.5-inc1 --smoothing 0.1 --mixup 0.8 --cutmix 1.0 --warmup-epochs 0 --cooldown-epochs 0

Command line for fine-tuning a pre-trained model on a transfer learning dataset.

./distributed_train.sh 4 /path/to/cifar10 --model sequencer2d_s -b 128 -j 4 --num-classes 10 --dataset torch/cifar10 --pretrained --opt adamw --epochs 200 --sched cosine --native-amp --img-size 224 --clip-grad 1 --drop-path 0.1 --lr 0.0001 --weight-decay 1e-4 --remode pixel --aa rand-m9-mstd0.5-inc1 --smoothing 0.1 --mixup 0.8 --cutmix 1.0 --warmup-lr 1e-6 --warmup-epochs 5

Validation

To evaluate our Sequencer models, run:

python validate.py /path/to/imagenet --model sequencer2d_s -b 16 --input-size 3 224 224 --amp

Reference

You may want to cite:

@article{tatsunami2022sequencer,
  title={Sequencer: Deep LSTM for Image Classification},
  author={Tatsunami, Yuki and Taki, Masato},
  journal={arXiv preprint arXiv:2205.01972},
  year={2022}
}

Acknowledgment

This implementation is based on pytorch-image-models by Ross Wightman. We thank for his brilliant work.


We thank Graduate School of Artificial Intelligence and Science, Rikkyo University (Rikkyo AI) which supports us with computational resources, facilities, and others.
AnyTech Co. Ltd. provided valuable comments on the early versions and encouragement. We thank them for their cooperation. In particular, We thank Atsushi Fukuda for organizing discussion opportunities.

You might also like...

Simple-Image-Classification - Simple Image Classification Code (PyTorch)

Simple-Image-Classification Simple Image Classification Code (PyTorch) Yechan Kim This repository contains: Python3 / Pytorch code for multi-class ima

8 Oct 29, 2022

Image Classification - A research on image classification and auto insurance claim prediction, a systematic experiments on modeling techniques and approaches

A research on image classification and auto insurance claim prediction, a systematic experiments on modeling techniques and approaches

0 Jan 23, 2022

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 7, 2022

Deep learning based hand gesture recognition using LSTM and MediaPipie.

Hand Gesture Recognition Deep learning based hand gesture recognition using LSTM and MediaPipie. Demo video using PingPong Robot Files Pretrained mode

24 Nov 11, 2022

Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

1 Dec 16, 2021

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Deep Image Search - AI-Based Image Search Engine Deep Image Search is an AI-based image search engine that includes deep transfer learning features Ex

139 Jan 1, 2023

The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

Watermark-Robustness-Toolbox - Official PyTorch Implementation This repository contains the official PyTorch implementation of the following paper to

49 Dec 19, 2022

Comments

Did you also perform tests on other RNN models?

Hello, thank you for the interesting paper. I am wondering whether you also tried GRUs or regular RNNs? Were LSTMs always better than the other RNN models?
question

opened by lars-nieradzik 2

Sequencer: Deep LSTM for Image Classification

Related tags

Overview

Sequencer: Deep LSTM for Image Classification

Abstract

Schematic diagrams

Model Zoo

Usage

Requirements

Data preparation

Traning

Validation

Reference

Acknowledgment

You might also like...

Simple-Image-Classification - Simple Image Classification Code (PyTorch)

Image Classification - A research on image classification and auto insurance claim prediction, a systematic experiments on modeling techniques and approaches

A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

Deep learning based hand gesture recognition using LSTM and MediaPipie.

Image Captioning using CNN ,LSTM and Attention

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

Automatic deep learning for image classification.

paper: Hyperspectral Remote Sensing Image Classification Using Deep Convolutional Capsule Network

Using LSTM write Tang poetry

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

CNN+LSTM+CTC based OCR implemented using tensorflow.

A small C++ implementation of LSTM networks, focused on OCR.

OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Using multidimensional LSTM neural networks to create a forecast for Bitcoin price

Multi-layer convolutional LSTM with Pytorch

Comments

Did you also perform tests on other RNN models?

Releases(weights)

weights(Apr 28, 2022)

Owner

Yuki Tatsunami

PianoPlayer - Automatic fingering generator for piano scores

[Singing Log] Let your program learn to sing!

Simple, hackable offline speech to text - using the VOSK-API.

This is a realtime voice translator program which gets input from user at any language and converts it to the desired language that the user asks

The official repository for Audio ALBERT

A simple python script to play bell sound in your system infinitely, just for fun and experimental purposes

BART aids transcribe tasks by taking a source audio file and creating automatic repeated loops, allowing transcribers to listen to fragments multiple times

Audio pitch-shifting & re-sampling utility, based on the EMU SP-1200

Enhanced Audio Player for Discord

The project aims to develop a personal-assistant for Windows & Linux-based systems

DCL - An easy to use diacritic library used for diacritic and accent manipulation.

A simple music player, powered by Python, utilising various libraries such as Tkinter and Pygame

🎵 A music bot for discord servers!

Port Hitsuboku Kumi Chinese CVVC voicebank to deepvocal. / 筆墨クミDeepvocal中文音源

Mopidy is an extensible music server written in Python

Sync Toolbox - Python package with reference implementations for efficient, robust, and accurate music synchronization based on dynamic time warping (DTW)

Python wrapper around sox.

A Quick Music Player Made Fully in Python

Code for paper 'Audio-Driven Emotional Video Portraits'.

L-SpEx: Localized Target Speaker Extraction