Pytorch Implementation for (STANet+ and STANet)

Last update: Nov 29, 2022

Related tags

Deep Learning STANet

Overview

Pytorch Implementation for (STANet+ and STANet)

V₂-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V₂

V₁-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V₁

Introduction

This repository contains the source code, results, and evaluation toolbox of STANet+ (V2), which are the journal extension version of our paper STANet (V₁) published at CVPR-2021.
Compared our conference version STANet (V₂), which has been extended in two distinct aspects.
First on the basis of multisource and multiscale perspectives which have been adopted by the CVPR version (V1), we have provided a deep insight into the relationship between multigranularity perception (Fig.2) and real human attention behaved in visual-auditory environment.
Second without using any complex networks, we have provided an elegant framework to complementary integrate multisource, multiscale, and multigranular information (Fig.1) to formulate pseudofixations which are very consistent with the real ones. Apart from achieving significant performance gain, this work also provides a comprehensive solution for mimicking multimodality attention.

Figure 1: STANet+ mainly focuses on devising a weakly supervised approach for the spatial-temporal-audio (STA) fixation prediction task, where the key innovation is that, as one of the first attempts, we automatically convert semantic category tags to pseudofixations via the newly proposed selective class activation mapping (SCAM) and the upgraded version SCAM+ that has been additionally equipped with the multigranularity perception ability. The obtained pseudofixations can be used as the learning objective to guide knowledge distillation to teach two individual fixation prediction networks (i.e., STA and STA+), which jointly enable generic video fixation prediction without requiring any video tags.

Figure 2: Some representative ’fixation shifting’ cases, additional multigranularity information (i.e., long/crossterm information) has been shown before collecting fixations in A_SRC. Clearly, by comparing A_FIX0, A_FIX1, and A _FIX2, we can easily notice that the multigranularity information could draw human attention to the most meaningful objects and make the fixations to be more focused.

Dependencies

Windows10
NVIDIA GeForce RTX 2070 SUPER & NVIDIA GeForce RTX 1080Ti
python 3.6.4
Matlab R2016b
pytorch 1.8.0
soundmodel

Preparation

Downloading the official pretrained visual and audio model

Visual:resnext101_32x8d, vgg16
Audio: vggsound, net = torch.load('vggsound_netvlad').

Downloading the training dataset and testing dataset:

Training dataset: AVE(Audio Visual Event Location).
Testing dataset: AVAD, DIEM, SumMe, ETMD, Coutrot.

Training

Note
We use Fourier-transform to transform audio features as audio stream input, therefore, you firstly need to use the function audiostft.py to convert the audio files (.wav) to get the audio features(.h5).

Step 1. SCAM training

Coarse: Separately training branches of S_coarse, SA_coarse, ST_coarse ，it should be noted that the coarse stage is coarse location, so the size is set to 256 to ensure object-wise location accuracy.
Fine: Separately re-training branches of S_fine, SA_fine, ST_fine，it should be noted that the fine stage is a fine location, so the size is set to 356 to ensure regional location exactness.

Step2. SCAM+ training

S+: Separately training branches of S+_short, S+_long, S+_cross, because it is frame-wise relational reasoning network, the network is the same, so we only need to change the source of the input data.
SA+: Separately training branches of SA+_long, SA+_cross.
ST+: Separately training branches of ST+_short, ST+_long, ST+_cross.

Step 3. pseudoGT generation

In order to facilitate the display of matrix data processing, Matlab2016b was performed in coarse location of inter-frame smoothing and pseudo GT data post-processing.

Step 4. STA and STA+ training

Training the model of STA and STA+ using the AVE video frames with the generated pseudoGT.

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).

Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

The model weight file STANet+, STANet, AudioSwitch:
(Baidu Netdisk, code:6afo).

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.

You may need to revise the algorithms, data_root, and maps_root defined in the main.m.

We provide the saliency maps of the SOTA:

(STANet+, STANet, ITTI, GBVS, SCLI, AWS-D, SBF, CAM, GradCAM, GradCAMpp, SGradCAMpp, xGradCAM, SSCAM, ScoCAM, LCAM, ISCAM, ACAM, EGradCAM, ECAM, SPG, VUNP, WSS, MWS, WSSA).
(Baidu Netdisk, code:6afo).

Quantitative comparisons:

Qualitative results of our method and eight representative saliency models: ITTI, GBVS, SCLI, SBF, AWS-D, WSS, MWS, WSSA. It can be observed that our method is able to handle various challenging scenes well and produces more accurate results than other competitors.

Qualitative comparisons:

Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on 6 datasets. Bold means the best result, " denotes the higher the score, the better the performance.

References

[1][Tsiami, A., Koutras, P., Maragos, P.STAViS: Spatio-Temporal AudioVisual Saliency Network. (CVPR 2020).] (https://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf)
[2][Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C. Audio-Visual Event Localization in Unconstrained Videos. (ECCV 2018)] (https://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf)
[3][Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. Vggsound: A Large-Scale Audio-Visual Dataset. (ICASSP 2020)] (https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf)

Citation

If you find this work useful for your research, please consider citing the following paper:

@InProceedings{Wang_2021_CVPR,  
    author    = {Wang, Guotao and Chen, Chenglizhao and Fan, Deng-Ping and Hao, Aimin and Qin, Hong},
    title     = {From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach},  
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},  
    month     = {June},  
    year      = {2021},  
    pages     = {15119-15128}  
}  


@misc{wang2021weakly,
    title={Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception}, 
    author={Guotao Wang and Chenglizhao Chen and Dengping Fan and Aimin Hao and Hong Qin},
    year={2021},
    eprint={2112.13697},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Pytorch Implementation for (STANet+ and STANet)

Related tags

Overview

Pytorch Implementation for (STANet+ and STANet)

V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V2

V1-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V1

Introduction

Dependencies

Preparation

Downloading the official pretrained visual and audio model

Downloading the training dataset and testing dataset:

Training

Step 1. SCAM training

Step2. SCAM+ training

Step 3. pseudoGT generation

Step 4. STA and STA+ training

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).

Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.

You may need to revise the algorithms, data_root, and maps_root defined in the main.m.

We provide the saliency maps of the SOTA:

Quantitative comparisons:

Qualitative comparisons:

References

Citation

Owner

GuotaoWang

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

RoMa: A lightweight library to deal with 3D rotations in PyTorch.

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

unet-family: Ultimate version

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee TopologyPreservation in Segmentations"

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Gems & Holiday Package Prediction

From a body shape, infer the anatomic skeleton.

PyTorch implementation of our ICCV 2019 paper: Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

PyTorch implementation of our paper: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

OpenMMLab Detection Toolbox and Benchmark

Spatial Contrastive Learning for Few-Shot Classification (SCL)

A lightweight face-recognition toolbox and pipeline based on tensorflow-lite

Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

V₂-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V₂

V₁-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V₁