PyTorch impelementations of BERT-based Spelling Error Correction Models.

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型,使用PyTorch实现

数据准备

  1. http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
  2. 解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
  3. 复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
  4. 下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录
  5. 请确保以下文件在 datasets/csc 中
    train.sgml
    B1_training.sgml
    C1_training.sgml  
    SIGHAN15_CSC_A2_Training.sgml  
    SIGHAN15_CSC_B2_Training.sgml  
    SIGHAN15_CSC_TestInput.txt
    SIGHAN15_CSC_TestTruth.txt
    

环境准备

  1. 使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境(推荐)
  2. 克隆本项目并进入项目根目录
  3. 安装所需依赖 pip install -r requirements.txt
  4. 如果出现报错 GLIBC 版本过低的问题(GLIBC 的版本更迭容易出事故,不推荐更新),openCC 改为安装较低版本(例如 1.1.0)
  5. 在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型,首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型,目前支持以下配置文件:

  • train_bert4csc.yml
  • train_macbert4csc.yml
  • train_SoftMaskedBert.yml

如有其他需求,可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component sentence level acc p r f
Detection 0.5045 0.8252 0.8416 0.8333
Correction 0.8055 0.9395 0.8748 0.9060

Bert类

char level

MODEL p r f
BERT4CSC 0.9269 0.8651 0.8949
MACBERT4CSC 0.9380 0.8736 0.9047

sentence level

model acc p r f
BERT4CSC 0.7990 0.8482 0.7214 0.7797
MACBERT4CSC 0.8027 0.8525 0.7251 0.7836

推理

方法一,使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下:

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二,直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重,使用transformers或pycorrector调用

  1. 使用convert_to_pure_state_dict.py导出bert权重
  2. 后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目,请按如下格式引用:

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护,侵权必究。

更新记录

20210618

  1. 修复数据处理的编码报错问题

20210518

  1. 将BERT4CSC检错任务改为使用FocalLoss
  2. 更新修改后的模型实验结果
  3. 降低数据处理时保留原文的概率

20210517

  1. 对BERT4CSC模型新增检错任务
  2. 新增基于LineByLine文件的inference

References

  1. Spelling Error Correction with Soft-Masked BERT
  2. http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
  3. https://github.com/wdimmy/Automatic-Corpus-Generation
  4. transformers
  5. https://github.com/sunnyqiny/Confusionset-guided-Pointer-Networks-for-Chinese-Spelling-Check
  6. SoftMaskedBert-PyTorch
  7. Deep-Learning-Project-Template
  8. https://github.com/lonePatient/TorchBlocks
  9. https://github.com/shibing624/pycorrector
Owner
Heng Cai
NLPer
Heng Cai
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sangwon Beak 2 Mar 04, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022