You Only 👀 One Sequence

TL;DR: We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark.
This project is under active development.

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

by Yuxin Fang¹ *, Bencheng Liao¹ *, Xinggang Wang^1

✉️, Jiemin Fang^{2, 1}, Jiyang Qi¹, Rui Wu³, Jianwei Niu³, Wenyu Liu¹.

¹ School of EIC, HUST, ² Institute of AI, HUST, ³ Horizon Robotics.

(*) equal contribution, (^✉️) corresponding author.

arXiv technical report (arXiv 2106.00666)

You Only Look at One Sequence (YOLOS)

The Illustration of YOLOS

Highlights

Directly inherited from ViT (DeiT), YOLOS is not designed to be yet another high-performance object detector, but to unveil the versatility and transferability of Transformer from image recognition to object detection. Concretely, our main contributions are summarized as follows:

We use the mid-sized ImageNet-1k as the sole pre-training dataset, and show that a vanilla ViT (DeiT) can be successfully transferred to perform the challenging object detection task and produce competitive COCO results with the fewest possible modifications, i.e., by only looking at one sequence (YOLOS).
We demonstrate that 2D object detection can be accomplished in a pure sequence-to-sequence manner by taking a sequence of fixed-sized non-overlapping image patches as input. Among existing object detectors, YOLOS utilizes minimal 2D inductive biases. Moreover, it is feasible for YOLOS to perform object detection in any dimensional space unaware the exact spatial structure or geometry.
For ViT (DeiT), we find the object detection results are quite sensitive to the pre-train scheme and the detection performance is far from saturating. Therefore the proposed YOLOS can be used as a challenging benchmark task to evaluate different pre-training strategies for ViT (DeiT).
We also discuss the impacts as wel as the limitations of prevalent pre-train schemes and model scaling strategies for Transformer in vision through transferring to object detection.

Results

Model	Pre-train Epochs	ViT (DeiT) Weight / Log	Fine-tune Epochs	Eval Size	YOLOS Checkpoint / Log	AP @ COCO val
`YOLOS-Ti`	300	FB	300	512	Baidu Drive, Google Drive / Log	28.7
`YOLOS-S`	200	Baidu Drive, Google Drive / Log	150	800	Baidu Drive, Google Drive / Log	36.1
`YOLOS-S`	300	FB	150	800	Baidu Drive, Google Drive / Log	36.1
`YOLOS-S (dWr)`	300	Baidu Drive, Google Drive / Log	150	800	Baidu Drive, Google Drive / Log	37.6
`YOLOS-B`	1000	FB	150	800	Baidu Drive, Google Drive / Log	42.0

Notes:

The access code for Baidu Drive is yolo.
The FB stands for model weights provided by DeiT (paper, code). Thanks for their wonderful works.
We will update other models in the future, please stay tuned :)

Requirement

This codebase has been developed with python version 3.6, PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

Before finetuning on COCO, you need download the ImageNet pretrained model to the /path/to/YOLOS/ directory

To train the YOLOS-Ti model in the paper, run this command:


python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 2 \
    --lr 5e-5 \
    --epochs 300 \
    --backbone_name tiny \
    --pre_trained /path/to/deit-tiny.pth\
    --eval_size 512 \
    --init_pe_size 800 1333 \
    --output_dir /output/path/box_model

To train the YOLOS-S model with 200 epoch pretrained Deit-S in the paper, run this command:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco --batch_size 1 --lr 2.5e-5 --epochs 150 --backbone_name small --pre_trained /path/to/deit-small-200epoch.pth --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --output_dir /output/path/box_model

To train the YOLOS-S model with 300 epoch pretrained Deit-S in the paper, run this command:

python -m torch.distributed.launch \ --nproc_per_node=8 \ --use_env main.py \ --coco_path /path/to/coco --batch_size 1 \ --lr 2.5e-5 \ --epochs 150 \ --backbone_name small \ --pre_trained /path/to/deit-small-300epoch.pth\ --eval_size 800 \ --init_pe_size 512 864 \ --mid_pe_size 512 864 \ --output_dir /output/path/box_model

To train the YOLOS-S (dWr) model in the paper, run this command:


python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 1 \
    --lr 2.5e-5 \
    --epochs 150 \
    --backbone_name small_dWr \
    --pre_trained /path/to/deit-small-dWr-scale.pth\
    --eval_size 800 \
    --init_pe_size 512 864 \
    --mid_pe_size 512 864 \
    --output_dir /output/path/box_model

To train the YOLOS-B model in the paper, run this command:


python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env main.py \
    --coco_path /path/to/coco
    --batch_size 1 \
    --lr 2.5e-5 \
    --epochs 150 \
    --backbone_name base \
    --pre_trained /path/to/deit-base.pth\
    --eval_size 800 \
    --init_pe_size 800 1344 \
    --mid_pe_size 800 1344 \
    --output_dir /output/path/box_model

Evaluation

To evaluate YOLOS-Ti model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 800 1333 --resume /path/to/YOLOS-Ti

To evaluate YOLOS-S model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/YOLOS-S

To evaluate YOLOS-S (dWr) model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small_dWr --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/YOLOS-S(dWr)

To evaluate YOLOS-B model on COCO, run:

python main.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 800 1344 --mid_pe_size 800 1344 --resume /path/to/YOLOS-B

Visualization

We have observed some intriguing properties of YOLOS, and we are working on a notebook to better demonstrate them, please stay tuned :)

Visualize box prediction and object categories distribution：

To Get visualization in the paper, you need the finetuned YOLOS models on COCO, run following command to get 100 Det-Toks prediction on COCO val split, then it will generate /path/to/YOLOS/visualization/modelname-eval-800-eval-pred.json

python cocoval_predjson_generation.py --coco_path /path/to/coco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume /path/to/yolos-s-model.pth --output_dir ./visualization

To get all ground truth object categories on all images from COCO val split, run following command to generate /path/to/YOLOS/visualization/coco-valsplit-cls-dist.json

python cocoval_gtclsjson_generation.py --coco_path /path/to/coco --batch_size 1 --output_dir ./visualization

To visualize the distribution of Det-Toks' bboxs and categories, run following command to generate .png files in /path/to/YOLOS/visualization/

 python visualize_dettoken_dist.py --visjson /path/to/YOLOS/visualization/modelname-eval-800-eval-pred.json --cococlsjson /path/to/YOLOS/visualization/coco-valsplit-cls-dist.json

Visualize self-attention of the [DetTok] token on the different heads of the last layer：

we are working on a notebook to better demonstrate them, please stay tuned :)

Acknowledgement ❤️

This project is based on DETR (paper, code), DeiT (paper, code) and timm. Thanks for their wonderful works.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :

@article{YOLOS,
  title={You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection},
  author={Fang, Yuxin and Liao, Bencheng and Wang, Xinggang and Fang, Jiemin and Qi, Jiyang and Wu, Rui and Niu, Jianwei and Liu, Wenyu},
  journal={arXiv preprint arXiv:2106.00666},
  year={2021}
}

You Only 👀 One Sequence

Related tags

Overview

You Only 👀 One Sequence

You Only Look at One Sequence (YOLOS)

The Illustration of YOLOS

Highlights

Results

Requirement

Data preparation

Training

Evaluation

Visualization

Acknowledgement ❤️

Citation

Owner

Hust Visual Learning Team

Github project for Attention-guided Temporal Coherent Video Object Matting.

DeepLearning Anomalies Detection with Bluetooth Sensor Data

Adaptive, interpretable wavelets across domains (NeurIPS 2021)

Brax is a differentiable physics engine that simulates environments made up of rigid bodies, joints, and actuators

Joint deep network for feature line detection and description

Neural Network Libraries

This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Unofficial Implementation of MLP-Mixer, Image Classification Model

Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

High-Resolution 3D Human Digitization from A Single Image.

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

Arbitrary Distribution Modeling with Censorship in Real Time 59 2 60 3 Bidding Advertising for KDD'21

Natural Intelligence is still a pretty good idea.

PixelPick This is an official implementation of the paper "All you need are a few pixels: semantic segmentation with PixelPick."

Unofficial pytorch implementation of 'Image Inpainting for Irregular Holes Using Partial Convolutions'

Style-based Point Generator with Adversarial Rendering for Point Cloud Completion (CVPR 2021)

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.