A minimal Conformer ASR implementation adapted from ESPnet.

Overview

Conformer ASR

A minimal Conformer ASR implementation adapted from ESPnet.

Introduction

I want to use the pre-trained English ASR model provided by ESPnet. However, ESPnet is relatively heavy for me. So here I try to extract only the conformer ASR part from ESPnet so that I can do better customization. Let's do it.

There are bunch of models available for ASR listed here. I choose the one with name:

kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave
Its performance can be found [here](https://zenodo.org/record/4604066#.YbxsX5FByV4), toggle me to see.
  • WER
dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean 2703 54402 97.9 1.9 0.2 0.2 2.3 28.6
decode_asr_asr_model_valid.acc.ave/dev_other 2864 50948 94.5 5.1 0.5 0.6 6.1 48.3
decode_asr_asr_model_valid.acc.ave/test_clean 2620 52576 97.7 2.1 0.2 0.3 2.6 31.4
decode_asr_asr_model_valid.acc.ave/test_other 2939 52343 94.7 4.9 0.5 0.7 6.0 49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean 2703 54402 98.3 1.5 0.2 0.2 1.9 25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other 2864 50948 95.8 3.7 0.4 0.5 4.6 40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean 2620 52576 98.1 1.7 0.2 0.3 2.1 26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other 2939 52343 95.8 3.7 0.5 0.5 4.7 42.4
  • CER
dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean 2703 288456 99.4 0.3 0.2 0.2 0.8 28.6
decode_asr_asr_model_valid.acc.ave/dev_other 2864 265951 98.0 1.2 0.8 0.7 2.7 48.3
decode_asr_asr_model_valid.acc.ave/test_clean 2620 281530 99.4 0.3 0.3 0.3 0.9 31.4
decode_asr_asr_model_valid.acc.ave/test_other 2939 272758 98.2 1.0 0.7 0.7 2.5 49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean 2703 288456 99.5 0.3 0.2 0.2 0.7 25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other 2864 265951 98.3 1.0 0.7 0.5 2.2 40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean 2620 281530 99.5 0.3 0.3 0.2 0.7 26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other 2939 272758 98.5 0.8 0.7 0.5 2.1 42.4
  • TER
dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/dev_clean 2703 68010 97.5 1.9 0.7 0.4 2.9 28.6
decode_asr_asr_model_valid.acc.ave/dev_other 2864 63110 93.4 5.0 1.6 1.0 7.6 48.3
decode_asr_asr_model_valid.acc.ave/test_clean 2620 65818 97.2 2.0 0.8 0.4 3.3 31.4
decode_asr_asr_model_valid.acc.ave/test_other 2939 65101 93.7 4.5 1.8 0.9 7.2 49.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_clean 2703 68010 97.8 1.5 0.7 0.3 2.5 25.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/dev_other 2864 63110 94.6 3.8 1.6 0.7 6.1 40.0
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_clean 2620 65818 97.6 1.6 0.8 0.3 2.7 26.2
decode_asr_lm_lm_train_lm_transformer2_bpe5000_scheduler_confwarmup_steps25000_batch_bins500000000_accum_grad2_use_amptrue_valid.loss.ave_asr_model_valid.acc.ave/test_other 2939 65101 94.7 3.5 1.8 0.7 6.0 42.4

ASR step by step

1. Setup code

pip install .

2. Download the model and unzip it

wget https://zenodo.org/record/4604066/files/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave.zip?download=1 -o conformer.zip
unzip conformer.zip

3. Run an example

import torch
import librosa
from mmds.utils.spectrogram import MelSpectrogram
from conformer_asr import Conformer, Tokenizer

sample_rate = 16000
cfg_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/config.yaml"
bpe_path = "./data/en_unnorm_token_list/bpe_unigram5000/bpe.model"
ckpt_path = "./exp_unnorm/asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000/valid.acc.ave_10best.pth"

tokenizer = Tokenizer(cfg_path, bpe_path)
conformer = Conformer(tokenizer, ckpt_path=ckpt_path)
conformer.eval()

spec_fn = MelSpectrogram(
    sample_rate,
    hop_length=256,
    f_min=0,
    f_max=8000,
    win_length=512,
    power=2,
)

w0, _ = librosa.load("./example.m4a", sample_rate)
w0 = torch.from_numpy(w0)
m0 = spec_fn(w0).t()

l = len(m0)

# create batch with different length audio (yes, supported)
x = [m0, m0[: l // 2], m0[: l // 4]]

ref = "This is a test video for youtube-dl. For more information, contact [email protected]".lower()
hyps = conformer.decode(x, beam_width=20)

print("REF", ref)
for hyp in hyps:
    print("HYP", hyp.lower())
  • Results
REF this is a test video for youtube-dl. for more information, contact [email protected]
HYP this is a test video for you do bl for more information -- contact the hih aging at the hihaging, not the
HYP this is a test for you d bl for more information
HYP this is a testim for you to

Features

Supported

  • Batched decoding

Not supported yet

  • Transformer language model
  • Other checkpoints
Owner
Niu Zhe
Niu Zhe
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Summarization module based on KoBART

KoBART-summarization Install KoBART pip install git+https://github.com/SKT-AI/KoBART#egg=kobart Requirements pytorch==1.7.0 transformers==4.0.0 pytor

seujung hwan, Jung 148 Dec 28, 2022
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

wav2vec_finetune Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks Initial test: gender recognition on this dat

8 Aug 11, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
Practical Machine Learning with Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Dipanjan (DJ) Sarkar 2k Jan 08, 2023
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

1 Nov 24, 2021
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022