Model for recasing and repunctuating ASR transcripts

Last update: Dec 29, 2022

Related tags

Text Data & NLP recasepunc

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

lower: keep lowercase
upper: convert to upper case
capitalize: set first letter as upper case
other: left as is

And the following punctuation labels:

o: no punctuation
period: .
comma: ,
question: ?
exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
- Iterations: 19000
- Batch size: 16
- Max length: 256
- Seed: 871253
- Cut-drop probability: 0.1
- Train loss: 0.021128975618630648
- Valid loss: 0.015684964135289192
- Recasing accuracy: 96.73
- Punctuation accuracy: 95.02
  - All punctuation F-score: 67.79
  - Comma F-score: 67.94
  - Period F-score: 72.91
  - Question F-score: 57.57
  - Exclamation mark F-score: 15.78
- Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y

Comments

Is it possible to customize for new language?

Dear Benoit Favre,

Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

Thank you in advance!

opened by ican24 5
Can't get attribute 'WordpieceTokenizer'

Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

opened by padmalcom 4
Can't do inference

Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

opened by MatFrancois 3
Memory usage

Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

opened by gubri 1

Russian model doesn't work, while English does

When I use Russian model, it gives me this error:

WARNING: reverting to cpu as cuda is not available
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']

 File "C:\pypy\rus\recasepunc.py", line 741, in <module>
    main(config, config.action, config.action_args)
  File "C:\pypy\rus\recasepunc.py", line 715, in main
    generate_predictions(config, *args)
  File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
    for line in sys.stdin:
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte

 File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
  File "C:\pypy\app.py", line 32, in process_audio
    cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.

Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

opened by xenia19 0

Export model to be Used in C++
Is it possible that export model to something that can be used in C++ using libtorch?

export existing model(checkpoint provided in this repo)

export model after I train with my own data which option above possible, or both?
opened by leohuang2013 0
While running pretrained German model: AttributeError: Can't get attribute 'Trie' on

I am trying to use pretrained German model:

https://alphacephei.com/vosk/models/vosk-recasepunc-de-0.21.zip

and as mentioned in readme file, I run:

python example.py de-test.txt

but I keep getting following error:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/home/ali/ali_initos_work/internal/data_science/speech_to_text/vosk/vosk_env/lib/python3.7/site-packages/transformers/tokenization_utils.py'>

Any idea if the model itself is wrong?

opened by alihashaam 2

RuntimeError when predicting with the french models

I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

j'aime les fleurs les olives et la raclette

When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

Traceback (most recent call last): File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module> main(config, config.action, config.action_args) File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main generate_predictions(config, *args) File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions model.load_state_dict(loaded['model_state_dict']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids".

I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

opened by maelchiotti 2
parameters like --dab_rate can't be set from cmd line bc they are bool
look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

main(config, config.action, config.action_args)

'''
opened by al-zatv 0
Cannot use trained model for validation or prediction

Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

opened by khusainovaidar 5

Releases(0.3)

0.3(Feb 3, 2022)

Checkpoint release
Source code(tar.gz)
Source code(zip)
en.23000(1249.49 MB)
fr-txt.large.19000(523.93 MB)
fr.22000(1575.50 MB)
zh.24000(1166.63 MB)
0.2(Sep 26, 2021)

Fix predictions when input is shorter than max length
Source code(tar.gz)
Source code(zip)
0.1(Sep 20, 2021)

First French model trained on 160M tokens from common crawl.
Source code(tar.gz)
Source code(zip)
fr-txt.large.19000(1571.78 MB)

Owner

Benoit Favre

GitHub Repository

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

1 Jan 27, 2022

Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

5.8k Jan 04, 2023

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

161 Dec 16, 2022

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

86 Dec 28, 2022

中文空间语义理解评测

中文空间语义理解评测最新消息 2021-04-10 🚩 排行榜发布： Leaderboard 2021-04-05 基线系统发布： SpaCE2021-Baseline 2021-04-05 开放数据提交：提交结果 2021-04-01 开放报名：我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023

✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

AI-For-Road-Safety Challenge hosted by Omdena Hyderabad Chapter Original Repo Link : https://github.com/OmdenaAI/omdena-india-roadsafety Final Present

7 Nov 29, 2022

Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

48 Dec 14, 2022

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

1 Feb 02, 2022

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

2.3k Jan 08, 2023

Model for recasing and repunctuating ASR transcripts

Related tags

Overview

Recasing and punctuation model based on Bert

Installation

Prediction

Models

Training

Comments

Releases(0.3)

0.3(Feb 3, 2022)

0.2(Sep 26, 2021)

0.1(Sep 20, 2021)

Owner

Benoit Favre

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

Open Source Neural Machine Translation in PyTorch

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

中文空间语义理解评测

✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

Japanese synonym library

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

I can help you convert your images to pdf file.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

spaCy plugin for Transformers , Udify, ELmo, etc.

Precision Medicine Knowledge Graph (PrimeKG)

Weakly-supervised Text Classification Based on Keyword Graph

Simple and efficient RevNet-Library with DeepSpeed support

Use fastai-v2 with HuggingFace's pretrained transformers

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

To be a next-generation DL-based phenotype prediction from genome mutations.