[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Last update: Dec 22, 2022

Overview

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021)

Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao

This repository provides code for paper "Full-Duplex Strategy for Video Object Segmentation" accepted by the ICCV-2021 conference (arXiv Version / 中译版本).
This project is under construction. If you have any questions about our paper or bugs in our git project, feel free to contact me.
If you like our FSNet for your personal research, please cite this paper (BibTeX).

1. News

[2021/08/24] Upload the training script for video object segmentation.
[2021/08/22] Upload the pre-trained snapshot and the pre-computed results on U-VOS and V-SOD tasks.
[2021/08/20] Release inference code, evaluation code (VSOD).
[2021/07/20] Create Github page.

2. Introduction

Why?

Appearance and motion are two important sources of information in video object segmentation (VOS). Previous methods mainly focus on using simplex solutions, lowering the upper bound of feature collaboration among and across these two cues.

Figure 1: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appear- ance) and our full-duplex strategy. In contrast, our FS- Net offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue.

What?

In this paper, we study a novel framework, termed the FSNet (Full-duplex Strategy Network), which designs a relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding subspaces. Furthermore, the bidirectional purification module (BPM) is introduced to update the inconsistent features between the spatial-temporal embeddings, effectively improving the model's robustness.

Figure 2: The pipeline of our FSNet. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction.

How?

By considering the mutual restraint within the full-duplex strategy, our FSNet performs the cross-modal feature-passing (i.e., transmission and receiving) simultaneously before the fusion and decoding stage, making it robust to various challenging scenarios (e.g., motion blur, occlusion) in VOS. Extensive experiments on five popular benchmarks (i.e., DAVIS16, FBMS, MCL, SegTrack-V2, and DAVSOD19) show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.

Figure 3: Qualitative results on five datasets, including DAVIS16, MCL, FBMS, SegTrack-V2, and DAVSOD19.

3. Usage

How to Inference?

Download the test dataset from Baidu Driver (PSW: aaw8) or Google Driver and save it at ./dataset/*.
Install necessary libraries: PyTorch 1.1+, scipy 1.2.2, PIL
Download the pre-trained weights from Baidu Driver (psw: 36lm) or Google Driver. Saving the pre-trained weights at ./snapshot/FSNet/2021-ICCV-FSNet-20epoch-new.pth
Just run python inference.py to generate the segmentation results.
About the post-processing technique DenseCRF we used in the original paper, you can find it here: DSS-CRF.

How to train our model from scratch?

Download the train dataset from Baidu Driver (PSW: u01t) or Google Driver Set1/Google Driver Set2 and save it at ./dataset/*. Our training pipeline consists of three steps:

First, train the model using the combination of static SOD dataset (i.e., DUTS) with 12,926 samples and U-VOS datasets (i.e., DAVIS16 & FBMS) with 2,373 samples.
- Set --train_type='pretrain_rgb' and run python train.py in terminal
Second, train the model using the optical-flow map of U-VOS datasets (i.e., DAVIS16 & FBMS).
- Set --train_type='pretrain_flow' and run python train.py in terminal
Third, train the model using the pair of frame and optical flow of U-VOS datasets (i.e., DAVIS16 & FBMS).
- Set --train_type='finetune' and run python train.py in terminal

4. Benchmark

Unsupervised/Zero-shot Video Object Segmentation (U/Z-VOS) task

NOTE: In the U-VOS, all the prediction results are strictly binary. We only adopt the naive binarization algorithm (i.e., threshold=0.5) in our experiments.

Quantitative results (NOTE: The following results have slight improvement compared with the reported results in our conference paper):

mean-J recall-J decay-J mean-F recall-F decay-F T

FSNet (w/ CRF) 0.834 0.945 0.032 0.831 0.902 0.026 0.213

FSNet (w/o CRF) 0.823 0.943 0.033 0.833 0.919 0.028 0.213
Pre-Computed Results: Please download the prediction results of FSNet, refer to Baidu Driver (psw: ojsl) or Google Driver.
Evaluation Toolbox: We use the standard evaluation toolbox from DAVIS16. (Note that all the pre-computed segmentations are downloaded from this link).

	mean-J	recall-J	decay-J	mean-F	recall-F	decay-F	T
FSNet (w/ CRF)	0.834	0.945	0.032	0.831	0.902	0.026	0.213
FSNet (w/o CRF)	0.823	0.943	0.033	0.833	0.919	0.028	0.213

Video Salient Object Detection (V-SOD) task

NOTE: In the V-SOD, all the prediction results are non-binary.

Pre-Computed Results: Please download the prediction results of FSNet (Baidu Driver, PSW: rgk1) or Google Driver.
Evaluation Toolbox: We use the standard evaluation toolbox from DAVSOD benchmark.

4. Citation

@inproceedings{ji2021FSNet,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fu, Keren and Wu, Zhe and Fan, Deng-Ping and Shen, Jianbing and Shao, Ling},
  booktitle={IEEE ICCV},
  year={2021}
}

5. Acknowledgements

Many thanks to my collaborator Ph.D. Zhe Wu, who provides excellent work SCRN and design inspirations.

[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Related tags

Overview

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021)

1. News

2. Introduction

Why?

What?

How?

3. Usage

How to Inference?

How to train our model from scratch?

4. Benchmark

Unsupervised/Zero-shot Video Object Segmentation (U/Z-VOS) task

Video Salient Object Detection (V-SOD) task

4. Citation

5. Acknowledgements

Owner

Daniel-Ji

Neuron Merging: Compensating for Pruned Neurons (NeurIPS 2020)

Nonnegative spatial factorization for multivariate count data

PyTorch implementation for paper StARformer: Transformer with State-Action-Reward Representations.

Machine learning Bot detection technique, based on United States election dataset

Deep Federated Learning for Autonomous Driving

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

Official code for the ICLR 2021 paper Neural ODE Processes

Code of TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

Development kit for MIT Scene Parsing Benchmark

ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

STARCH compuets regional extreme storm physical characteristics and moisture balance based on spatiotemporal precipitation data from reanalysis or climate model data.

The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

Title: Graduate-Admissions-Predictor

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

A PyTorch Implementation of Gated Graph Sequence Neural Networks (GGNN)

A testcase generation tool for Persistent Memory Programs.