Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Overview

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*.

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet.

*Note: please see the ArXiv version for additional results on the test set.

Setup

  1. Clone this module and any submodules: git clone --recurse-submodules [email protected]:iapalm/Spoken-ObjectNet.git
  2. Follow the directions in data.md to set up ObjectNet images and the Spoken ObjectNet-50k corpus
  3. This code was tested with PyTorch 1.9 with CUDA 10.2 and Python 3.8.8.
  4. To train the models with the code as-is, we use 2 GPUs with 11 Gb of memory. A single GPU can be used, but the batch size or other parameters should be reduced.
  5. Note about the speed of this code: This code will work as-is on the Spoken ObjectNet audio captions, but the speed could be greatly improved. A main bottleneck is the resampling of the audio wav files from 48 kHz to 16 kHz, which is done with librosa here. We suggest to pre-process the audio files into the desired format first, and then remove this line or the on-the-fly spectrogram conversion entirely. We estimate the speed will improve 5x.
  6. On our servers, the zero-shot evaluation takes around 20-30 minutes and training takes around 4-5 days. As mentioned in the previous point, this could be improved with audio pre-processing.

Running Experiments

We support 3 experiments that can be used as baselines for future work:

  • (1) Zero-shot evaluation of the ResDAVEnet-VQ model trained on Places-400k [2].
  • (2) Fine-tuning the ResDAVEnet-VQ model trained on Places-400k on Spoken ObjectNet with a frozen image branch .
  • (3) Training the ResDAVEnet-VQ model from scratch on Spoken ObjectNet with a frozen image branch.
  • Note: fine-tuning the image branch on Spoken ObjectNet is not permitted, but fine-tuning the audio branch is allowed.

Zero-shot transfer from Places-400k

  • Download and extract the directory containing the model weights from this link. Keep the folder named RDVQ_00000 and move it to the exps directory.
  • In scripts/train.sh, change data_dt to data/Spoken-ObjectNet-50k/metadata/SON-test.json to evaluate on the test set instead of the validation set.
  • Run the following command for zero-shot evaluation: source scripts/train.sh 00000 RDVQ_00000 "--resume True --mode eval"
  • The results are printed in exps/RDVQ_00000_transfer/train.out

Fine-tune the model from Places-400k

  • Download and extract the directory containing the args.pkl file which specifies the fine-tuning arguments. The directory at this link contains the args.pkl file as well as the model weights.
  • The model weights of the fine-tuned model are provided for easier evaluation. Run the following command to evaluate the model using those weights: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True --mode eval"
  • Otherwise, to fine-tune the model yourself, first move the model weights to a new folder model_dl, then make a new folder model to save the new weights, and then run the following command: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True". This still require the args.pkl file mentioned previously.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Train the model from scratch on Spoken ObjectNet

  • Run the following command to train the model from scratch: source scripts/train.sh 00000 RDVQ_scratch_frozen "--lr 0.001 --freeze-image-model True"
  • The model weights can be evaulated with source scripts/train.sh 00000 RDVQ_scratch_frozen "--resume True --mode eval"
  • We also provide the trained model weights at this link.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Contact

If You find any problems or have any questions, please open an issue and we will try to respond as soon as possible. You can also try emailing the first corresponding author.

References

[1] Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245

[2] David Harwath*, Wei-Ning Hsu*, and James Glass. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech. Proc. International Conference on Learning Representations (ICLR), 2020

Spoken ObjectNet - Bibtex:

@inproceedings{palmer21_interspeech,
  author={Ian Palmer and Andrew Rouditchenko and Andrei Barbu and Boris Katz and James Glass},
  title={{Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3650--3654},
  doi={10.21437/Interspeech.2021-245}
}
Owner
Ian Palmer
Ian Palmer
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
What are the best Systems? New Perspectives on NLP Benchmarking

What are the best Systems? New Perspectives on NLP Benchmarking In Machine Learning, a benchmark refers to an ensemble of datasets associated with one

Pierre Colombo 12 Nov 03, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

Telekom Open Source Software 13 Oct 24, 2022
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

ThreatFox2Misp Creating a Feed of MISP Events from ThreatFox (by abuse.ch) What will it do? This will fetch IOCs from ThreatFox by Abuse.ch, convert t

17 Nov 22, 2022
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
👄 The most accurate natural language detection library for Python, suitable for long and short text alike

1. What does this library do? Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a prepr

Peter M. Stahl 334 Dec 30, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022