Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Last update: Dec 07, 2022

Overview

MusCaps: Generating Captions for Music Audio

Ilaria Manco¹ ², Emmanouil Benetos¹, Elio Quinton², Gyorgy Fazekas¹
¹ Queen Mary University of London, ² Universal Music Group

This repository is the official implementation of "MusCaps: Generating Captions for Music Audio" (IJCNN 2021). In this work, we propose an encoder-decoder model to generate natural language descriptions of music audio. We provide code to train our model on any dataset of (audio, caption) pairs, together with code to evaluate the generated descriptions on a set of automatic metrics (BLEU, METEOR, ROUGE, CIDEr, SPICE, SPIDEr).

Setup

The code was developed in Python 3.7 on Linux CentOS 7 and training was carried out on an RTX 2080 Ti GPU. Other GPUs and platforms have not been fully tested.

Clone the repo

git clone https://github.com/ilaria-manco/muscaps
cd muscaps

You'll need to have the libsndfile library installed. All other requirements, including the code package, can be installed with

pip install -r requirements.txt
pip install -e .

Project structure

root
├─ configs                      # Config files
│   ├─ datasets
│   ├─ models  
│   └─ default.yaml              
├─ data                         # Folder to save data (input data, pretrained model weights, etc.)
│   ├─ audio_encoders   
│   ├─ datasets            
│   │   └─ dataset_name     
|   └── ...             
├─ muscaps
|   ├─ caption_evaluation_tools # Translation metrics eval on audio captioning 
│   ├─ datasets                 # Dataset classes
│   ├─ models                   # Model code
│   ├─ modules                  # Model components
│   ├─ scripts                  # Python scripts for training, evaluation etc.
│   ├─ trainers                 # Trainer classes
│   └─ utils                    # Utils
└─ save                         # Saved model checkpoints, logs, configs, predictions    
    └─ experiments
        ├── experiment_id1
        └── ...

Dataset

The datasets used in our experiments is private and cannot be shared, but details on how to prepare an equivalent music captioning dataset are provided in the data README.

Pre-trained audio feature extractors

For the audio feature extraction component, MusCaps uses CNN-based audio tagging models like musicnn. In our experiments, we use @minzwon's implementation and pre-trained models, which you can download from the official repo. For example, to obtain the weights for the HCNN model trained on the MagnaTagATune dataset, run the following commands

mkdir data/audio_encoders
cd data/audio_encoders/
wget https://github.com/minzwon/sota-music-tagging-models/raw/master/models/mtat/hcnn/best_model.pth
mv best_model.pth mtt_hcnn.pth

Training

Dataset, model and training configurations are set in the respective yaml files in configs. Some of the fields can be overridden by arguments in the CLI (for more details on this, refer to the training script).

To train the model with the default configs, simply run

cd muscaps/scripts/
python train.py <baseline/attention> --feature_extractor <musicnn/hcnn> --pretrained_model <msd/mtt>  --device_num <gpu_number>

This will generate an experiment_id and create a new folder in save/experiments where the output will be saved.

If you wish to resume training from a saved checkpoint, run

python train.py <baseline/attention> --experiment_id <experiment_id>  --device_num <gpu_number>

Evaluation

To evaluate a model saved under <experiment_id> on the captioning task, run

cd muscaps/scripts/
python caption.py <experiment_id> --metrics True

Cite

@misc{manco2021muscaps,
      title={MusCaps: Generating Captions for Music Audio}, 
      author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gyorgy Fazekas},
      year={2021},
      eprint={2104.11984},
      archivePrefix={arXiv}
}

Acknowledgements

This repo reuses some code from the following repos:

sota-music-tagging-models by @minzwon
caption-evaluation-tools by @audio-captioning
mmf by @facebookresearch
a-PyTorch-Tutorial-to-Image-Captioning by @sgrvinod
allennlp by @allenai

Contact

If you have any questions, please get in touch: [email protected].

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Related tags

Overview

MusCaps: Generating Captions for Music Audio

Setup

Project structure

Dataset

Pre-trained audio feature extractors

Training

Evaluation

Cite

Acknowledgements

Contact

Owner

Ilaria Manco

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI

A new test set for ImageNet

Breaking the Dilemma of Medical Image-to-image Translation

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

MakeItTalk: Speaker-Aware Talking-Head Animation

Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Res2Net for Instance segmentation and Object detection using MaskRCNN

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

Object DGCNN and DETR3D, Our implementations are built on top of MMdetection3D.

A working implementation of the Categorical DQN (Distributional RL).

A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021)

Code for paper 'Hand-Object Contact Consistency Reasoning for Human Grasps Generation' at ICCV 2021

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

NBEATSx: Neural basis expansion analysis with exogenous variables

Continual learning with sketched Jacobian approximations

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Related tags

Overview

MusCaps: Generating Captions for Music Audio

Setup

Project structure

Dataset

Pre-trained audio feature extractors

Training

Evaluation

Cite

Acknowledgements

Contact

Owner

Ilaria Manco

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

A new test set for ImageNet

Breaking the Dilemma of Medical Image-to-image Translation

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

MakeItTalk: Speaker-Aware Talking-Head Animation

Small little script to scrape, parse and check for active tor nodes. Can be used as proxies.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Res2Net for Instance segmentation and Object detection using MaskRCNN

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

Object DGCNN and DETR3D, Our implementations are built on top of MMdetection3D.

A working implementation of the Categorical DQN (Distributional RL).

A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021)

Code for paper 'Hand-Object Contact Consistency Reasoning for Human Grasps Generation' at ICCV 2021

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

NBEATSx: Neural basis expansion analysis with exogenous variables

Continual learning with sketched Jacobian approximations

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI