Who calls the shots? Rethinking Few-Shot Learning for Audio (WASPAA 2021)

Overview

rethink-audio-fsl

This repo contains the source code for the paper "Who calls the shots? Rethinking Few-Shot Learning for Audio." (WASPAA 2021)

Table of Contents

Dataset

Models in this work are trained on FSD-MIX-CLIPS, an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material, a subset of FSD50K, and soundscape annotations in JAMS format which can be used to reproduce FSD-MIX-SED using Scaper. All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

To reproduce FSD-MIX-SED:

  1. Download all files from Zenodo.
  2. Extract .tar.gz files. You will get
  • FSD_MIX_SED.annotations: 281,039 annotation files, 35GB
  • FSD_MIX_SED.source: 10,296 single-labeled audio clips, 1.9GB
  • FSD_MIX_CLIPS.annotations: 5 annotation files for each class/data split
  • vocab.json: 89 classes, each class is then labeled by its index in the list in following experiments. 0-58: base, 59-73: novel-val, 74-88: novel-test.

We will use FSD_MIX_SED.annotations and FSD_MIX_SED.source to reproduce the audio data in FSD_MIX_SED, and use the audio with FSD_MIX_CLIPS.annotation for the following training and evaluation.

  1. Install Scaper
  2. Generate soundscapes from jams files by running the command. Set annpaths and audiopath to the extracted folders, and savepath to the desired path to save output audio files.
python ./data/generate_soundscapes.py \
--annpath PATH-TO-FSD_MIX_SED.annotations \
--audiopath PATH-TO-FSD_MIX_SED.source \
--savepath PATH-TO-SAVE-OUTPUT

Note that this will generate 281,039 audio files with a size of ~450GB to the folder FSD_MIX_SED.audio at the set savepath.

If you want to get the foreground material (FSD-MIX-SED.source) directly from FSD50K instead of downloading them, run

python ./data/preprocess_foreground_sounds.py \
--fsdpath PATH-TO-FSD50K \
--outpath PATH_TO_SAVE_OUTPUT

Experiment

We provide source code to train the best performing embedding model (pretrained OpenL3 + FC) and three different few-shot methods to predict both base and novel class data.

Preprocessing

Once audio files are reproduced, we pre-compute OpenL3 embeddings of clips in FSD-MIX-CLIPS and save them.

  1. Install OpenL3
  2. Set paths of the downloaded FSD_MIX_CLIPS.annotations and generated FSD_MIX_SED.audio, and run
python get_openl3emb_and_filelist.py \
--annpath PATH-TO-FSD_MIX_CLIPS.annotations \
--audiopath PATH-TO-FSD_MIX_SED.audio \
--savepath PATH_TO_SAVE_OUTPUT

This generates 614,533 .pkl files where each file contains an embedding. A set of filelists will also be saved under current folder.

Environment

Create conda environment from the environment.yml file and activate it.

Note that you only need the environment if you want to train/evaluate the models. For reproducing the dataset, see Dataset.

conda env create -f environment.yml
conda activate dfsl

Training

  • Training configuration can be specified using config files in ./config
  • Model checkpoints will be saved in the folder ./experiments, and tensorboard data will be saved in the folder ./run

1. Base classifier

First, to train the base classifier on base classes, run

python train.py --config openl3CosineClassifier --openl3

2. Few-shot weight generator for DFSL

Once the base model is trained, we can train the few-shot weight generator for DFSL by running

python train.py --config openl3CosineClassifierGenWeightAttN5 --openl3

By default, DFSL is trained with 5 support examples: n=5, to train DFSL with different n, run

# n=10
python train.py --config openl3CosineClassifierGenWeightAttN10 --openl3

# n=20
python train.py --config openl3CosineClassifierGenWeightAttN20 --openl3

# n=30
python train.py --config openl3CosineClassifierGenWeightAttN30 --openl3

Evaluation

We evaluate the trained models on test data from both base and novel classes. For each novel class, we need to sample a support set. Run the command below to split the original filelist for test classes to test_support_filelist.pkl and test_query_filelist.pkl.

python get_test_support_and_query.py
  • Here we consider monophonic support examples with mixed(random) SNR. Code to run evaluation with polyphonic support examples with specific low/high SNR will be released soon.

For evaluation, we compute features for both base and novel test data, then make predictions and compute metrics in a joint label space. The computed features, model predictions, and metrics will be saved in the folder ./experiments. We consider 3 few-shot methods to predict novel classes. To test different number of support examples, set different n_pos in the following commands.

1. Prototype

# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifier --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifier --openl3 --n_base 59 --n_novel 15 --n_pos 5

2. DFSL

# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifierGenWeightAttN5 --openl3

# Get and save model prediction, run this multiple time (niter) to count for random selection of novel examples.
python pred.py --config=openl3CosineClassifierGenWeightAttN5 --openl3 --niter 100 --n_base 59 --n_novel 15 --n_pos 5

# compute and save evaluation metrics based on model prediction
python metrics.py --config=audioset_pannCosineClassifierGenWeightAttN5 --openl3 --n_base 59 --n_novel 15 --n_pos 5

3. Logistic regression

Train a binary logistic regression model for each novel class. Note that we need to sample n_neg of examples from the base training data as the negative examples. Default n_neg is 100. We also did a hyperparameter search on n_neg based on the validation data while n_pos changing from 5 to 30:

  • n_pos=5, n_neg=100
  • n_pos=10, n_neg=500
  • n_pos=20, n_neg=1000
  • n_pos=30, n_neg=5000
# Extract embeddings of evaluation data and save them.
python save_features.py --config=openl3CosineClassifier --openl3

# Train binary logistic regression models, predict test data, and compute metrics
python logistic_regression.py --config=openl3CosineClassifier --openl3 --niter 10 --n_base 59 --n_novel 15 --n_pos 5 --n_neg 100

Reference

This code is built upon the implementation from FewShotWithoutForgetting

Citation

Please cite our paper if you find the code or dataset useful for your research.

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

Owner
Yu Wang
Ph.D. Candidate
Yu Wang
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 04, 2023
Multiview 3D object detection on MultiviewC dataset through moft3d.

Multiview Orthographic Feature Transformation for 3D Object Detection Multiview 3D object detection on MultiviewC dataset through moft3d. Introduction

Jiahao Ma 20 Dec 21, 2022
Collection of machine learning related notebooks to share.

ML_Notebooks Collection of machine learning related notebooks to share. Notebooks GAN_distributed_training.ipynb In this Notebook, TensorFlow's tutori

Sascha Kirch 14 Dec 22, 2022
Social Fabric: Tubelet Compositions for Video Relation Detection

Social-Fabric Social Fabric: Tubelet Compositions for Video Relation Detection This repository contains the code and results for the following paper:

Shuo Chen 7 Aug 09, 2022
Fully convolutional deep neural network to remove transparent overlays from images

Fully convolutional deep neural network to remove transparent overlays from images

Marc Belmont 1.1k Jan 06, 2023
Multi-objective gym environments for reinforcement learning.

MO-Gym: Multi-Objective Reinforcement Learning Environments Gym environments for multi-objective reinforcement learning (MORL). The environments follo

Lucas Alegre 74 Jan 03, 2023
FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset (CVPR2022)

FaceVerse FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang

Lizhen Wang 219 Dec 28, 2022
Prompts - Read a textfile of prompts and import into anki via ankiconnect

prompts read a textfile of prompts and import into anki via ankiconnect Usage In

Alexander Cobleigh 2 Jul 28, 2022
Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

tf-fsvd TensorFlow Implementation of Functional Singular Value Decomposition for paper Fast Graph Learning with Unique Optimal Solutions Cite If you f

Sami Abu-El-Haija 14 Nov 25, 2021
The devkit of the nuPlan dataset.

The devkit of the nuPlan dataset.

Motional 264 Jan 03, 2023
Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

MaskCycleGAN-VC Unofficial PyTorch implementation of Kaneko et al.'s MaskCycleGAN-VC (2021) for non-parallel voice conversion. MaskCycleGAN-VC is the

86 Dec 25, 2022
Quantized tflite models for ailia TFLite Runtime

ailia-models-tflite Quantized tflite models for ailia TFLite Runtime About ailia TFLite Runtime ailia TF Lite Runtime is a TensorFlow Lite compatible

ax Inc. 13 Dec 23, 2022
[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022
ISBI 2022: Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image.

Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Introduction This repository contains the PyTorch implem

25 Nov 09, 2022
An Active Automata Learning Library Written in Python

AALpy An Active Automata Learning Library AALpy is a light-weight active automata learning library written in pure Python. You can start learning auto

TU Graz - SAL Dependable Embedded Systems Lab (DES Lab) 78 Dec 30, 2022
DI-HPC is an acceleration operator component for general algorithm modules in reinforcement learning algorithms

DI-HPC: Decision Intelligence - High Performance Computation DI-HPC is an acceleration operator component for general algorithm modules in reinforceme

OpenDILab 185 Dec 29, 2022
Code for paper "Learning to Reweight Examples for Robust Deep Learning"

learning-to-reweight-examples Code for paper Learning to Reweight Examples for Robust Deep Learning. [arxiv] Environment We tested the code on tensorf

Uber Research 261 Jan 01, 2023
Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling

RHGN Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling Dependencies torch==1.6.0 torchvision==0.7.0 dgl==0.7.1

Big Data and Multi-modal Computing Group, CRIPAC 6 Nov 29, 2022
On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021))

PTvsBT On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021) Citation Please cite a

Sunbow Liu 10 Nov 25, 2022
labelpix is a graphical image labeling interface for drawing bounding boxes

Welcome to labelpix 👋 labelpix is a graphical image labeling interface for drawing bounding boxes. 🏠 Homepage Install pip install -r requirements.tx

schissmantics 26 May 24, 2022