A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

  • Python: 3.6
  • Cuda: 10.0
  • Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

  • BERT: chinese-roberta-wwm-ext

    Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
    Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/

  • Phonetic Encoder: pretrain_pho.sh

  • Graphic Encoder: pretrain_res.sh

  • Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

Metrics

  • "D" means "Detection Level", "C" means "Correction Level".
  • "A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
SpellGCN - 74.8 80.7 77.7 - 72.1 77.7 75.9
BERT 82.4 74.2 78.0 76.1 81.0 71.6 75.3 73.4
ReaLiSe 84.7 77.3 81.3 79.3 84.0 75.9 79.9 77.8

SIGHAN14

Method D-A D-P D-R D-F C-A C-P C-R C-F
Pointer Network - 63.2 82.5 71.6 - 79.3 68.9 73.7
SpellGCN - 65.1 69.5 67.2 - 63.1 67.2 65.3
BERT 75.7 64.5 68.6 66.5 74.6 62.4 66.3 64.3
ReaLiSe 78.4 67.8 71.5 69.6 77.7 66.3 70.0 68.1

SIGHAN13

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
SpellGCN 78.8 85.7 78.8 82.1 77.8 84.6 77.8 81.0
BERT 77.0 85.0 77.0 80.8 77.4 83.0 75.2 78.9
ReaLiSe 82.7 88.6 82.5 85.4 81.4 87.2 81.2 84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
DaDa
A student majoring in Computer Science in BIT.
DaDa
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

PRAnCER (Platform enabling Rapid Annotation for Clinical Entity Recognition) is a web platform that enables the rapid annotation of medical terms within clinical notes. A user can highlight spans of

Sontag Lab 39 Nov 14, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 01, 2023
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022