Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Last update: Dec 13, 2022

Related tags

Deep Learning multisensory

Overview

[Paper] [Project page]

This repository contains code for the paper:

Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018

This release includes code and models for:

On/off-screen source separation: separating the speech of an on-screen speaker from background sounds.
Blind source separation: audio-only source separation using u-net and PIT.
Sound source localization: visualizing the parts of a video that correspond to sound-making actions.
Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).

Setup

Install Python 2.7
Install ffmpeg
Install TensorFlow, e.g. through pip:

pip install tensorflow     # for CPU evaluation only
pip install tensorflow-gpu # for GPU support

We used TensorFlow version 1.8, which can be installed with:

pip install tensorflow-gpu==1.8

Install other python dependencies

pip install numpy matplotlib pillow scipy

Download the pretrained models and sample data

./download_models.sh
./download_sample_data.sh

Pretrained audio-visual features

We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see shift_example.py for a simple example that uses these pretrained features.

Audio-visual source separation

To try the on/off-screen source separation model, run:

python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/

This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):

Input	On-screen	Off-screen

We can visually mask out one of the two on-screen speakers, thereby removing their voice:

python sep_video.py ../data/crossfire.mp4 --model full --mask l --out ../results/
python sep_video.py ../data/crossfire.mp4 --model full --mask r --out ../results/

This produces the following videos (click to watch):

Source	Left	Right

Blind (audio-only) source separation

This baseline trains a u-net model to minimize a permutation invariant loss.

python sep_video.py ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/

The model will write the two separated streams in an arbitrary order.

Visualizing the locations of sound sources

To view the self-supervised network's class activation map (CAM), use the --cam flag:

python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/

This produces a video in which the CAM is overlaid as a heat map:

Action recognition and fine-tuning

We have provided example code for training an action recognition model (e.g. on the UCF-101 dataset) in videocls.py). This involves fine-tuning our pretrained, audio-visual network. It is also possible to train this network with only visual data (no audio).

Citation

If you use this code in your research, please consider citing our paper:

@article{multisensory2018,
  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Owens, Andrew and Efros, Alexei A},
  journal={arXiv preprint arXiv:1804.03641},
  year={2018}
}

Updates

11/08/18: Fixed a bug in the class activation map example code. Added Tensorflow 1.9 compatibility.

Acknowledgements

Our u-net code draws from this implementation of pix2pix.

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Related tags

Overview

Contents

Setup

Pretrained audio-visual features

Audio-visual source separation

Blind (audio-only) source separation

Visualizing the locations of sound sources

Action recognition and fine-tuning

Citation

Updates

Acknowledgements

Owner

Andrew Owens

Code for "Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans" CVPR 2021 best paper candidate

Merlion: A Machine Learning Framework for Time Series Intelligence

Codebase for Attentive Neural Hawkes Process (A-NHP) and Attentive Neural Datalog Through Time (A-NDTT)

GPU Accelerated Non-rigid ICP for surface registration

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

Reinforcement Learning for finance

MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts (ICLR 2022)

A Dataset of Python Challenges for AI Research

A PyTorch implementation for V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,

IndoNLI: A Natural Language Inference Dataset for Indonesian

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation

4th place solution to datafactory challenge by Intermarché.

PyTorch implementations of the beta divergence loss.

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU