Model for recasing and repunctuating ASR transcripts

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

  • lower: keep lowercase
  • upper: convert to upper case
  • capitalize: set first letter as upper case
  • other: left as is

And the following punctuation labels:

  • o: no punctuation
  • period: .
  • comma: ,
  • question: ?
  • exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

  • Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

  • French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
    • Iterations: 19000
    • Batch size: 16
    • Max length: 256
    • Seed: 871253
    • Cut-drop probability: 0.1
    • Train loss: 0.021128975618630648
    • Valid loss: 0.015684964135289192
    • Recasing accuracy: 96.73
    • Punctuation accuracy: 95.02
      • All punctuation F-score: 67.79
      • Comma F-score: 67.94
      • Period F-score: 72.91
      • Question F-score: 57.57
      • Exclamation mark F-score: 15.78
    • Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y
Comments
  • Is it possible to customize for new language?

    Is it possible to customize for new language?

    Dear Benoit Favre,

    Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

    Thank you in advance!

    opened by ican24 5
  • Can't get attribute 'WordpieceTokenizer'

    Can't get attribute 'WordpieceTokenizer'

    Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

    I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

    Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

    opened by padmalcom 4
  • Can't do inference

    Can't do inference

    Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

    opened by MatFrancois 3
  • Memory usage

    Memory usage

    Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

    opened by gubri 1
  • Russian model doesn't work, while English does

    Russian model doesn't work, while English does

    When I use Russian model, it gives me this error:

    WARNING: reverting to cpu as cuda is not available
    Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
    
     File "C:\pypy\rus\recasepunc.py", line 741, in <module>
        main(config, config.action, config.action_args)
      File "C:\pypy\rus\recasepunc.py", line 715, in main
        generate_predictions(config, *args)
      File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
        for line in sys.stdin:
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte
    
     File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
        return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
      File "C:\pypy\app.py", line 32, in process_audio
        cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 524, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.
    

    Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

    opened by xenia19 0
  • Export model to be Used in C++

    Export model to be Used in C++

    Is it possible that export model to something that can be used in C++ using libtorch?

    1. export existing model(checkpoint provided in this repo)
    2. export model after I train with my own data which option above possible, or both?
    opened by leohuang2013 0
  • RuntimeError when predicting with the french models

    RuntimeError when predicting with the french models

    I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

    j'aime les fleurs les olives et la raclette

    When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

    Traceback (most recent call last):
      File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module>
        main(config, config.action, config.action_args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main
        generate_predictions(config, *args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions
        model.load_state_dict(loaded['model_state_dict'])
      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for Model:
    	Unexpected key(s) in state_dict: "bert.position_ids".
    

    I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

    opened by maelchiotti 2
  • parameters like --dab_rate can't be set from cmd line bc they are bool

    parameters like --dab_rate can't be set from cmd line bc they are bool

    look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

    main(config, config.action, config.action_args)
    

    '''

    opened by al-zatv 0
  • Cannot use trained model for validation or prediction

    Cannot use trained model for validation or prediction

    Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

    But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

    File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

    Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

    opened by khusainovaidar 5
Releases(0.3)
Owner
Benoit Favre
Benoit Favre
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
FireFlyer Record file format, writer and reader for DL training samples.

FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux

77 Jan 04, 2023
中文問句產生器;使用台達電閱讀理解資料集(DRCD)

Transformer QG on DRCD The inputs of the model refers to we integrate C and A into a new C' in the following form. C' = [c1, c2, ..., [HL], a1, ..., a

Philip 1 Oct 22, 2021
Built for cleaning purposes in military institutions

Ferramenta do AL Construído para fins de limpeza em instituições militares. Instalação Requer python = 3.2 pip install -r requirements.txt Usagem Exe

0 Aug 13, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

#Sentic GCN Introduction This repository was used in our paper: Aspect-Based Sentiment Analysis via Affective Knowledge Enhanced Graph Convolutional N

Akuchi 35 Nov 16, 2022
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022