MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

AudioMUSIC-AVQA
Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/


What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

  • QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git
  2. Download data

    Annotations (QA pairs, etc.)

    • Available for download at here
    • The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage

    Features

    • We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.

    • The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):

      • VGGish feature shape: [T, 128]  Download (112.7M)
      • ResNet18 feature shape: [T, 512]  Download (972.6M)
      • R(2+1)D feature shape: [T, 512]  Download (973.9M)
    • The features are in the ./data/feats folder.

    • 14x14 features, too large to share ... but we can extract from raw video frames.

    Download videos frames

    • Raw videos: Availabel at Baidu Drive (password: cvpr):.

      Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

    • Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).

    • Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.

    • Pandas and ffmpeg libraries are required.

  3. Data pre-processing

    Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.

    python feat_script/extract_audio_cues/extract_audio.py	

    Extract video frames from videos. The extracted frames will be in the data/frames folder.

    python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
  4. Feature extraction

    Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

    python feat_script/extract_audio_feat/audio_feature_extractor.py

    2D visual feature. Pretrained models library is required.

    python feat_script/eatract_visual_feat/extract_rgb_feat.py

    3D visual feature.

    python feat_script/eatract_visual_feat/extract_3d_feat.py

    14x14 visual feature.

    python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py
  5. Baseline Model

    Training

    python net_grd_baseline/main_qa_grd_baseline.py --mode train

    Testing

    python net_grd_baseline/main_qa_grd_baseline.py --mode test
  6. Our Audio-Visual Spatial-Temporal Model

    We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Audio-Visual grounding generation

    python grounding_gen/main_grd_gen.py

    Training

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Testing

    python net_grd_avst/main_avst.py --mode test \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

Results

  1. Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.

  2. Visualized spatio-temporal grounding results

    We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

    Firstly, ./grounding_gen/models_grd_vis/ should be created.

    python grounding_gen/main_grd_gen_vis.py

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

Gradient - A Python program designed to create a reactive and ambient music listening experience

Gradient is a Python program designed to create a reactive and ambient music listening experience.

Alexander Vega 2 Jan 24, 2022
Delta TTA(Text To Audio) SoftWare

Text-To-Audio-Windows Delta TTA(Text To Audio) SoftWare Info You Can Use It For Convert Your Text To Audio File You Just Write Your Text And Your End

Delta Inc. 2 Dec 14, 2021
Python CD-DA ripper preferring accuracy over speed

Whipper Whipper is a Python 3 (3.6+) CD-DA ripper based on the morituri project (CDDA ripper for *nix systems aiming for accuracy over speed). It star

671 Jan 04, 2023
This library provides common speech features for ASR including MFCCs and filterbank energies.

python_speech_features This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs ar

James Lyons 2.2k Jan 04, 2023
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア

Hiroshiba 0 Aug 29, 2022
Audio features extraction

Yaafe Yet Another Audio Feature Extractor Build status Branch master : Branch dev : Anaconda : Install Conda Yaafe can be easily install with conda. T

Yaafe 231 Dec 26, 2022
Suyash More 111 Jan 07, 2023
Audio fingerprinting and recognition in Python

dejavu Audio fingerprinting and recognition algorithm implemented in Python, see the explanation here: How it works Dejavu can memorize audio by liste

Will Drevo 6k Jan 06, 2023
A python wrapper for REAPER

pyreaper A python wrapper for REAPER (Robust Epoch And Pitch EstimatoR) Installation pip install pyreaper Demonstration notebnook http://nbviewer.jupy

Ryuichi Yamamoto 56 Dec 27, 2022
Convert complex chord names to midi notes

ezchord Simple python script that can convert complex chord names to midi notes Prerequisites pip install midiutil Usage ./ezchord.py Dmin7 G7 C timi

Alex Zhang 2 Dec 20, 2022
A2DP agent for promiscuous/permissive audio sinc.

Promiscuous Bluetooth audio sinc A2DP agent for promiscuous/permissive audio sinc for Linux. Once installed, a Bluetooth client, such as a smart phone

Jasper Aorangi 4 May 27, 2022
Gammatone-based spectrograms, using gammatone filterbanks or Fourier transform weightings.

Gammatone Filterbank Toolkit Utilities for analysing sound using perceptual models of human hearing. Jason Heeris, 2013 Summary This is a port of Malc

Jason Heeris 188 Dec 14, 2022
A voice assistant which can be used to interact with your computer and controls your pc operations

Introduction 👨‍💻 It is a voice assistant which can be used to interact with your computer and also you have been seeing it in Iron man movies, but t

Sujith 84 Dec 22, 2022
controls volume using hand gestures

controls volume using hand gestures

1 Oct 11, 2021
SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats

SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats Note Neither this, or PyTgCalls are fully

SU Projects 58 Jan 02, 2023
Algorithmic Multi-Instrumental MIDI Continuation Implementation

Matchmaker Algorithmic Multi-Instrumental MIDI Continuation Implementation Taming large-scale MIDI datasets with algorithms This is a WIP so please ch

Alex 2 Mar 11, 2022
PatrikZero's CS:GO Hearing protection

Program that lowers volume when you die and get flashed in CS:GO. It aims to lower the chance of hearing damage by reducing overall sound exposure. Uses game state integration. Anti-cheat safe.

Patrik Žúdel 224 Dec 04, 2022
Hide Your Secret Message in any Wave Audio File.

HiddenWave Embedding secret messages in wave audio file What is HiddenWave Hiddenwave is a python based program for simple audio steganography. You ca

TechChip 99 Dec 28, 2022
Audio pitch-shifting & re-sampling utility, based on the EMU SP-1200

Pitcher.py Free & OS emulation of the SP-12 & SP-1200 signal chain (now with GUI) Pitch shift / bitcrush / resample audio files Written and tested in

morgan 13 Oct 03, 2022
We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

1 Nov 19, 2021