SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Overview

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

SeqFormer

PWC

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Wenqing Zhang, Xiang Bai, Song Bai

arXiv 2112.08275

Abstract

In this work, we present SeqFormer, a frustratingly simple model for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms should be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On the YouTube-VIS dataset, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model.

Visualization results on YouTube-VIS 2019 valid set

Installation

First, clone the repository locally:

git clone https://github.com/wjf5203/SeqFormer.git

Then, install PyTorch 1.7 and torchvision 0.8.

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 -c pytorch

Install dependencies and pycocotools for VIS:

pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Compiling CUDA operators:

cd ./models/ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py

Data Preparation

Download and extract 2019 version of YoutubeVIS train and val images with annotations from CodeLab or YouTubeVIS, and download COCO 2017 datasets. We expect the directory structure to be the following:

SeqFormer
├── datasets
│   ├── coco_keepfor_ytvis19.json
...
ytvis
├── train
├── val
├── annotations
│   ├── instances_train_sub.json
│   ├── instances_val_sub.json
coco
├── train2017
├── val2017
├── annotations
│   ├── instances_train2017.json
│   ├── instances_val2017.json

The modified coco annotations 'coco_keepfor_ytvis19.json' for joint training can be downloaded from [google].

Model zoo

Ablation model

Train on YouTube-VIS 2019, evaluate on YouTube-VIS 2019.

Model AP AP50 AP75 AR1 AR10
SeqFormer_ablation [google] 45.1 66.9 50.5 45.6 54.6

YouTube-VIS model

Train on YouTube-VIS 2019 and COCO, evaluate on YouTube-VIS 2019 val set.

Model AP AP50 AP75 AR1 AR10 Pretrain
SeqFormer_r50 [google] 47.4 69.8 51.8 45.5 54.8 weight
SeqFormer_r101 [google] 49.0 71.1 55.7 46.8 56.9 weight
SeqFormer_x101 [google] 51.2 75.3 58.0 46.5 57.3 weight
SeqFormer_swin_L [google] 59.3 82.1 66.4 51.7 64.4 weight

Training

We performed the experiment on NVIDIA Tesla V100 GPU. All models of SeqFormer are trained with total batch size of 32.

To train SeqFormer on YouTube-VIS 2019 with 8 GPUs , run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer_ablation.sh

To train SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly, run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer.sh

To train SeqFormer_swin_L on multiple nodes, run:

On node 1:

MASTER_ADDR=
   
     NODE_RANK=0 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

   

On node 2:

MASTER_ADDR=
   
     NODE_RANK=1 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

   

Inference & Evaluation

Evaluating on YouTube-VIS 2019:

python3 inference.py  --masks --backbone [backbone] --model_path /path/to/model_weights --save_path results.json 

To get quantitative results, please zip the json file and upload to the codalab server.

Citation

@article{wu2021seqformer,
      title={SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation}, 
      author={Junfeng Wu and Yi Jiang and Wenqing Zhang and Xiang Bai and Song Bai},
      journal={arXiv preprint arXiv:2112.08275},
      year={2021},
}

Acknowledgement

This repo is based on Deformable DETR and VisTR. Thanks for their wonderful works.

Owner
Junfeng Wu
PhD student, Huazhong University of Science and Technology, Computer Vision
Junfeng Wu
A fast Protein Chain / Ligand Extractor and organizer.

Are you tired of using visualization software, or full blown suites just to separate protein chains / ligands ? Are you tired of organizing the mess o

Amine Abdz 9 Nov 06, 2022
🎃 Core identification module of AI powerful point reading system platform.

ppReader-Kernel Intro Core identification module of AI powerful point reading system platform. Usage 硬件: Windows10、GPU:nvdia GTX 1060 、普通RBG相机 软件: con

CrashKing 1 Jan 11, 2022
Mouse Brain in the Model Zoo

Deep Neural Mouse Brain Modeling This is the repository for the ongoing deep neural mouse modeling project, an attempt to characterize the representat

Colin Conwell 15 Aug 22, 2022
A map update dataset and benchmark

MUNO21 MUNO21 is a dataset and benchmark for machine learning methods that automatically update and maintain digital street map datasets. Previous dat

16 Nov 30, 2022
Reference implementation for Structured Prediction with Deep Value Networks

Deep Value Network (DVN) This code is a python reference implementation of DVNs introduced in Deep Value Networks Learn to Evaluate and Iteratively Re

Michael Gygli 55 Feb 02, 2022
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
LAnguage Model Analysis

LAMA: LAnguage Model Analysis LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. The dataset

Meta Research 960 Jan 08, 2023
Tensorflow implementation of our method: "Triangle Graph Interest Network for Click-through Rate Prediction".

TGIN Tensorflow implementation of our method: "Triangle Graph Interest Network for Click-through Rate Prediction". Files in the folder dataset/ electr

Alibaba 21 Dec 21, 2022
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data Au

14 Nov 28, 2022
Learning from Synthetic Humans, CVPR 2017

Learning from Synthetic Humans (SURREAL) Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev and Cordelia Schmid,

Gul Varol 538 Dec 18, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

Eloi Moliner Juanpere 57 Jan 05, 2023
AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation AniGAN: Style-Guided Generative Adversarial Networks for U

Bing Li 81 Dec 14, 2022
Code for the paper "There is no Double-Descent in Random Forests"

Code for the paper "There is no Double-Descent in Random Forests" This repository contains the code to run the experiments for our paper called "There

2 Jan 14, 2022
A library for researching neural networks compression and acceleration methods.

A library for researching neural networks compression and acceleration methods.

Intel Labs 100 Dec 29, 2022
Prevent `CUDA error: out of memory` in just 1 line of code.

🐨 Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. 🚀 Features 🙅 Prevents CUDA error

RenChu Wang 1.7k Jan 02, 2023
Entity-Based Knowledge Conflicts in Question Answering.

Entity-Based Knowledge Conflicts in Question Answering Run Instructions | Paper | Citation | License This repository provides the Substitution Framewo

Apple 35 Oct 19, 2022
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

63 Aug 11, 2022
RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

[Paper] [Хабр] [Model Card] [Colab] [Kaggle] RuDOLPH 🦌 🎄 ☃️ One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP Russian Diffusio

AI Forever 232 Jan 04, 2023
Data-depth-inference - Data depth inference with python

Welcome! This readme will guide you through the use of the code in this reposito

Marco 3 Feb 08, 2022