Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

Localizing Visual Sounds the Hard Way

Related tags

Overview

Localizing-Visual-Sounds-the-Hard-Way

Environment

Flickr-SoundNet

VGG-Sound Source

Citation

Owner

Honglie Chen

Applying PVT to Semantic Segmentation

Action Segmentation Evaluation

Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Hypersearch weight debugging and losses tutorial

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

Shitty gaze mouse controller

SOTA model in CIFAR10

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Official code for our EMNLP2021 Outstanding Paper MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks

A toolkit for making real world machine learning and data analysis applications in C++

It's a implement of this paper：Relation extraction via Multi-Level attention CNNs

Dataloader tools for language modelling

PyTorch implementations for our SIGGRAPH 2021 paper: Editable Free-viewpoint Video Using a Layered Neural Representation.

Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

Official repository for ABC-GAN

Face Mask Detection System built with OpenCV, TensorFlow using Computer Vision concepts

Deploy a ML inference service on a budget in less than 10 lines of code.

The code written during my Bachelor Thesis "Classification of Human Whole-Body Motion using Hidden Markov Models".