Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Last update: Jan 02, 2023

Related tags

Deep Learning AudioCLIP

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

This repository contains implementation of the models described in the paper arXiv:2106.13043. This work based on our previous works:

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021).
ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).

Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

Downloading Pre-Trained Weights

The pre-trained model can be downloaded from the releases.

# AudioCLIP trained on AudioSet (text-, image- and audio-head simultaneously)
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

How to Run the Model

The required Python version is >= 3.7.

AudioCLIP

On the ESC-50 dataset

python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

On the UrbanSound8K dataset

python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

Cite Us

@misc{guzhov2021audioclip,
      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
      year={2021},
      eprint={2106.13043},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

76 Dec 22, 2022

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022

https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

149 Dec 19, 2022

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

967 Jan 4, 2023

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

58 Dec 24, 2022

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Git lfs was giving problems, so I removed all assets files from it - the files can be found in the "Release" anyways.

Also it was a bit problematic to use this project in other projects because the folder structure was lacking. I moved all files into an "audioclip" folder to fix python pathing for external projects.

I renamed master to main, but I doubt that this change is going to stay once this pull request is merged.

opened by NotNANtoN 0

Releases(v0.1)

v0.1(Jun 29, 2021)
Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)

CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)

ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)

AudioCLIP trained on AudioSet (+ video frames)

AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)

AudioCLIP-Partial-Training.pt – training of the audio-head only

Source code(tar.gz)
Source code(zip)
AudioCLIP-Full-Training.pt(512.41 MB)
AudioCLIP-Partial-Training.pt(512.41 MB)
bpe_simple_vocab_16e6.txt.gz(1.29 MB)
CLIP.pt(389.49 MB)
ESRNXFBSP.pt(119.01 MB)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

Abstract

Downloading Pre-Trained Weights

How to Run the Model

AudioCLIP

On the ESC-50 dataset

On the UrbanSound8K dataset

Cite Us

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

https://arxiv.org/abs/2102.11005

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Releases(v0.1)

v0.1(Jun 29, 2021)

Owner

Jupyter notebooks for the code samples of the book "Deep Learning with Python"

Best practices for segmentation of the corporate network of any company

Wind Speed Prediction using LSTMs in PyTorch

Hierarchical Uniform Manifold Approximation and Projection

Simple tutorials using Google's TensorFlow Framework

This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement"

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

An executor that loads ONNX models and embeds documents using the ONNX runtime.

A very short and easy implementation of Quantile Regression DQN

nextPARS, a novel Illumina-based implementation of in-vitro parallel probing of RNA structures.

Autonomous Movement from Simultaneous Localization and Mapping

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Official PyTorch implementation of "RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on" (IJCAI-ECAI 2022)

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation