Pytorch Implementation for (STANet+ and STANet)

Related tags

Deep LearningSTANet
Overview

Pytorch Implementation for (STANet+ and STANet)

V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V2

V1-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V1


Introduction

  • This repository contains the source code, results, and evaluation toolbox of STANet+ (V2), which are the journal extension version of our paper STANet (V1) published at CVPR-2021.
  • Compared our conference version STANet (V2), which has been extended in two distinct aspects.
    First on the basis of multisource and multiscale perspectives which have been adopted by the CVPR version (V1), we have provided a deep insight into the relationship between multigranularity perception (Fig.2) and real human attention behaved in visual-auditory environment.
    Second without using any complex networks, we have provided an elegant framework to complementary integrate multisource, multiscale, and multigranular information (Fig.1) to formulate pseudofixations which are very consistent with the real ones. Apart from achieving significant performance gain, this work also provides a comprehensive solution for mimicking multimodality attention.

Figure 1: STANet+ mainly focuses on devising a weakly supervised approach for the spatial-temporal-audio (STA) fixation prediction task, where the key innovation is that, as one of the first attempts, we automatically convert semantic category tags to pseudofixations via the newly proposed selective class activation mapping (SCAM) and the upgraded version SCAM+ that has been additionally equipped with the multigranularity perception ability. The obtained pseudofixations can be used as the learning objective to guide knowledge distillation to teach two individual fixation prediction networks (i.e., STA and STA+), which jointly enable generic video fixation prediction without requiring any video tags.

Figure 2: Some representative ’fixation shifting’ cases, additional multigranularity information (i.e., long/crossterm information) has been shown before collecting fixations in A_SRC. Clearly, by comparing A_FIX0, A_FIX1, and A _FIX2, we can easily notice that the multigranularity information could draw human attention to the most meaningful objects and make the fixations to be more focused.

Dependencies

  • Windows10
  • NVIDIA GeForce RTX 2070 SUPER & NVIDIA GeForce RTX 1080Ti
  • python 3.6.4
  • Matlab R2016b
  • pytorch 1.8.0
  • soundmodel

Preparation

Downloading the official pretrained visual and audio model

Visual:resnext101_32x8d, vgg16
Audio: vggsound, net = torch.load('vggsound_netvlad').

Downloading the training dataset and testing dataset:

Training dataset: AVE(Audio Visual Event Location).
Testing dataset: AVAD, DIEM, SumMe, ETMD, Coutrot.

Training

Note
We use Fourier-transform to transform audio features as audio stream input, therefore, you firstly need to use the function audiostft.py to convert the audio files (.wav) to get the audio features(.h5).

Step 1. SCAM training

Coarse: Separately training branches of Scoarse, SAcoarse, STcoarse ,it should be noted that the coarse stage is coarse location, so the size is set to 256 to ensure object-wise location accuracy.
Fine: Separately re-training branches of Sfine, SAfine, STfine,it should be noted that the fine stage is a fine location, so the size is set to 356 to ensure regional location exactness.

Step2. SCAM+ training

S+: Separately training branches of S+short, S+long, S+cross, because it is frame-wise relational reasoning network, the network is the same, so we only need to change the source of the input data.
SA+: Separately training branches of SA+long, SA+cross.
ST+: Separately training branches of ST+short, ST+long, ST+cross.

Step 3. pseudoGT generation

In order to facilitate the display of matrix data processing, Matlab2016b was performed in coarse location of inter-frame smoothing and pseudo GT data post-processing.

Step 4. STA and STA+ training

Training the model of STA and STA+ using the AVE video frames with the generated pseudoGT.

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).
Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

The model weight file STANet+, STANet, AudioSwitch:
(Baidu Netdisk, code:6afo).

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.
You may need to revise the algorithms, data_root, and maps_root defined in the main.m.
We provide the saliency maps of the SOTA:

(STANet+, STANet, ITTI, GBVS, SCLI, AWS-D, SBF, CAM, GradCAM, GradCAMpp, SGradCAMpp, xGradCAM, SSCAM, ScoCAM, LCAM, ISCAM, ACAM, EGradCAM, ECAM, SPG, VUNP, WSS, MWS, WSSA).
(Baidu Netdisk, code:6afo).

Quantitative comparisons:

Qualitative results of our method and eight representative saliency models: ITTI, GBVS, SCLI, SBF, AWS-D, WSS, MWS, WSSA. It can be observed that our method is able to handle various challenging scenes well and produces more accurate results than other competitors.

Qualitative comparisons:

Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on 6 datasets. Bold means the best result, " denotes the higher the score, the better the performance.

References

[1][Tsiami, A., Koutras, P., Maragos, P.STAViS: Spatio-Temporal AudioVisual Saliency Network. (CVPR 2020).] (https://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf)
[2][Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C. Audio-Visual Event Localization in Unconstrained Videos. (ECCV 2018)] (https://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf)
[3][Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. Vggsound: A Large-Scale Audio-Visual Dataset. (ICASSP 2020)] (https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf)

Citation

If you find this work useful for your research, please consider citing the following paper:

@InProceedings{Wang_2021_CVPR,  
    author    = {Wang, Guotao and Chen, Chenglizhao and Fan, Deng-Ping and Hao, Aimin and Qin, Hong},
    title     = {From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach},  
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},  
    month     = {June},  
    year      = {2021},  
    pages     = {15119-15128}  
}  


@misc{wang2021weakly,
    title={Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception}, 
    author={Guotao Wang and Chenglizhao Chen and Dengping Fan and Aimin Hao and Hong Qin},
    year={2021},
    eprint={2112.13697},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Owner
GuotaoWang
GuotaoWang
TensorFlow 2 AI/ML library wrapper for openFrameworks

ofxTensorFlow2 This is an openFrameworks addon for the TensorFlow 2 ML (Machine Learning) library

Center for Art and Media Karlsruhe 96 Dec 31, 2022
A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

Ivan Anishchenko 185 Jan 07, 2023
Repository to run object detection on a model trained on an autonomous driving dataset.

Autonomous Driving Object Detection on the Raspberry Pi 4 Description of Repository This repository contains code and instructions to configure the ne

Ethan 51 Nov 17, 2022
A way to store images in YAML.

YAMLImg A way to store images in YAML. I made this after seeing Roadcrosser's JSON-G because it was too inspiring to ignore this opportunity. Installa

5 Mar 14, 2022
Code repository for Self-supervised Structure-sensitive Learning, CVPR'17

Self-supervised Structure-sensitive Learning (SSL) Ke Gong, Xiaodan Liang, Xiaohui Shen, Liang Lin, "Look into Person: Self-supervised Structure-sensi

Clay Gong 219 Dec 29, 2022
Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

ReFine: Multi-Grained Explainability for GNNs This is the official code for Towards Multi-Grained Explainability for Graph Neural Networks (NeurIPS 20

Shirley (Ying-Xin) Wu 47 Dec 16, 2022
Minimal PyTorch implementation of YOLOv3

A minimal PyTorch implementation of YOLOv3, with support for training, inference and evaluation.

Erik Linder-Norén 6.9k Dec 29, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

117 Jan 07, 2023
The source code of CVPR17 'Generative Face Completion'.

GenerativeFaceCompletion Matcaffe implementation of our CVPR17 paper on face completion. In each panel from left to right: original face, masked input

Yijun Li 313 Oct 18, 2022
The Empirical Investigation of Representation Learning for Imitation (EIRLI)

The Empirical Investigation of Representation Learning for Imitation (EIRLI)

Center for Human-Compatible AI 31 Nov 06, 2022
Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature fo

Google Interns 50 Dec 21, 2022
Simple-Image-Classification - Simple Image Classification Code (PyTorch)

Simple-Image-Classification Simple Image Classification Code (PyTorch) Yechan Kim This repository contains: Python3 / Pytorch code for multi-class ima

Yechan Kim 8 Oct 29, 2022
Keras implementation of PersonLab for Multi-Person Pose Estimation and Instance Segmentation.

PersonLab This is a Keras implementation of PersonLab for Multi-Person Pose Estimation and Instance Segmentation. The model predicts heatmaps and vari

OCTI 160 Dec 21, 2022
An implementation of paper `Real-time Convolutional Neural Networks for Emotion and Gender Classification` with PaddlePaddle.

简介 通过PaddlePaddle框架复现了论文 Real-time Convolutional Neural Networks for Emotion and Gender Classification 中提出的两个模型,分别是SimpleCNN和MiniXception。利用 imdb_crop

8 Mar 11, 2022
Pointer-generator - Code for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks

Note: this code is no longer actively maintained. However, feel free to use the Issues section to discuss the code with other users. Some users have u

Abi See 2.1k Jan 04, 2023
Generative Flow Networks

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation Implementation for our paper, submitted to NeurIPS 2021 (also chec

Emmanuel Bengio 381 Jan 04, 2023
The Ludii general game system, developed as part of the ERC-funded Digital Ludeme Project.

The Ludii General Game System Ludii is a general game system being developed as part of the ERC-funded Digital Ludeme Project (DLP). This repository h

Digital Ludeme Project 50 Jan 04, 2023
This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Machine Learning Hand Detector This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Dev

Popstar Idhant 3 Feb 25, 2022
Synthesize photos from PhotoDNA using machine learning 🌱

Ribosome Synthesize photos from PhotoDNA. See the blog post for more information. Installation Dependencies You can install Python dependencies using

Anish Athalye 112 Nov 23, 2022
Full Resolution Residual Networks for Semantic Image Segmentation

Full-Resolution Residual Networks (FRRN) This repository contains code to train and qualitatively evaluate Full-Resolution Residual Networks (FRRNs) a

Toby Pohlen 274 Oct 27, 2022