Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Related tags

Text Data & NLPSTEMM
Overview

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation

This is a PyTorch implementation for the ACL 2022 main conference paper STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation.

Training a Model on MuST-C

Let's first take a look at training an En-De model as an example.

Enviroment Configuration

  1. Clone this repository:
git clone [email protected]:ictnlp/STEMM.git
cd STEMM/
  1. Install Montreal Forced Aligner following the official guidance. Please also download the pertained models and dictionary for MFA.

  2. Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:

pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
pip install inflect sentencepiece soundfile textgrid pandas

Data Preparation

  1. First make a directory to store the dataset:
TGT_LANG=de
MUSTC_ROOT=data/mustc/
mkdir -p $MUSTC_ROOT
  1. Download the MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the $MUSTC_ROOT path, and uncompress it:
cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-de.tar.gz
  1. Return to the root directory, run the preprocess script preprocess.sh, which will perform forced alignment and organize the raw data and alignment information into .tsv format for using:
sh preprocess.sh $TGT_LANG
  1. Finally, the directory $MUSTC_ROOT should look like this:
.
├── en-de
│   ├── config_raw.yaml
│   ├── data
│   ├── dev_raw_seg_plus.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_raw.model
│   ├── spm_unigram10000_raw.txt
│   ├── spm_unigram10000_raw.vocab
│   ├── train_raw_seg_plus.tsv
│   ├── tst-COMMON_raw_seg_plus.tsv
│   ├── tst-HE_raw_seg_plus.tsv
└── MUSTC_v1.0_en-de.tar.gz

Pretrain the MT Module

[OPTIONAL] Use External MT Corpus

If you want to use external MT corpus, please first pretrain a MT model on this corpus following these steps:

  1. Perform BPE on external corpus with the sentencepiece model learned on MuST-C. As we mentioned in our paper, we use WMT for En-De, En-Fr, En-Ru, En-Es, En-Ro, and OPUS100 for En-Pt, En-It, En-Nl as external corpus. You can download them from the internet and put them in the data/ext_en${TGT_LANG}/ directory. Run the following command and replace $input_file with the path of raw text to perform BPE. You should apply BPE to texts in both source and target language of all subset (train/valid/test).
python3 data/scripts/apply_spm.py --input-file $input_file --output-file $output_file --model data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.model
  1. Use fairseq-preprocess command to convert the BPE texts into fairseq formats. Make sure to use the sentencepiece dictionary learned on MuST-C.
$spm_dict=data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.txt
fairseq-preprocess --source-lang en --target-lang $TGT_LANG --trainpref data/ext_en${TGT_LANG}/train --validpref data/ext_en${TGT_LANG}/valid --testpref data/ext_en${TGT_LANG}/test --destdir data/ext_en${TGT_LANG}/binary --joined-dictionary --srcdict $spm_dict --tgtdict $spm_dict --workers=20 --nwordssrc 10000 --nwordstgt 10000
  1. Train the model using the following command:
sh pretrain_mt_ext.sh $TGT_LANG

Pretrain the MT module on MuST-C

  1. Run the following script to pretrain the MT module. The argument --load-pretrained-mt-encoder-decoder-from indicates the path of MT model pretrained on external corpus obtained in the last step.
sh pretrain_mt.sh $TGT_LANG
  1. To ensure consistent performance, we have released our checkpoints of pretrained MT modules. You can download them and directly use them do initialize the MT module in our model for the following experiments.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_mt.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_mt.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_mt.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_mt.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_mt.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_mt.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_mt.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_mt.pt

Training

  1. Download the pretrained wav2vec2.0 model from the official link, and put it in the checkpoints/ directory.
  2. Just run the training scripts:
sh train.sh $TGT_LANG

Evaluate

  1. Run the following script to average the last 10 checkpoints and evaluate on the tst-COMMON set:
sh test.sh mustc_en${TGT_LANG}_stmm_self_learning $TGT_LANG
  1. We also released our checkpoints as follows. You can download and evaluate them directly.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_stmm_self_learning.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_stmm_self_learning.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_stmm_self_learning.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_stmm_self_learning.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_stmm_self_learning.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_stmm_self_learning.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_stmm_self_learning.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_stmm_self_learning.pt

Citation

In this repository is useful for you, please cite as:

@inproceedings{fang-etal-2022-STEMM,
	title = {STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation},
	author = {Fang, Qingkai and Ye, Rong and Li, Lei and Feng, Yang and Wang, Mingxuan},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

Contact

If you have any questions, feel free to contact me at [email protected].

Owner
ICTNLP
Natural Language Processing Group, Institute of Computing Technology, Chinese Academy of Sciences
ICTNLP
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 01, 2023
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022