Transformer Based Korean Sentence Spacing Corrector

Last update: Apr 18, 2022

Related tags

Text Data & NLP TKOrrector

Overview

TKOrrector

Transformer Based Korean Sentence Spacing Corrector

License Summary

This solution is made available under Apache 2 license. See the LICENSE file.

Minimum Requirements

It is recommended that you run the Trainig on a machine with Nvidia GPU with drivers and CUDA installed.

Prerequisites

Clone this repo and cd into it.
Install dependencies. Preferrably in a virtual env.

a. Optional: Create new virtual env. Conda example below.
conda create --name TKOrrector python=3.9 -y
conda activate TKOrrector

b. Install PyTorch with CUDA conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

or

b. Install PyTorch without GPU conda install pytorch torchvision torchaudio cpuonly -c pytorch

c. Install dependencies
pip install -r requirements.txt

Run

You can run the pretrained model without the need to Train.

Download the pretrained model and extract into the current directory (tar zxvf TKOrrector.tar.gz)

sh demo.sh

Example demo run screen and results.

Train

Download the Corpus

Go to NIKL Corpus Download Site and apply for a new license.

The cost is free but you need to sign an agreement. It is recommended that you upload the corpus file on an object storage such as GCS to quickly download on additional machines such as GCP GCE to use a VM with GPU for training as needed without huge upfront cost. Edit src/download_corpus.sh to download the Corpus file and expand it into the designated directory.

cd src
sh download_corpus.sh

Run the data prep stage

Change lines 51, 53 in prepare_corpus_with_tokenizer.sh to increase the training dataset size.  
The second argument is the number of files to include into the training set + 1.  
`get_corpus "../data/$CORPUS1/*" 10`  
Above command would include 9 files (manual pdf file is skipped) from the Newspaper corpus.

Run the data prep command.

sh prepare_corpus_with_tokenizer.sh

Run the training stage

Run the training command.

sh train.sh

Run the Evaluation

After the training is done, evaluation of the model with test dataset can be performed with batch translations by running the command below.

sh calculate_metrics.sh

Transformer Based Korean Sentence Spacing Corrector

Related tags

Overview

TKOrrector

License Summary

Minimum Requirements

Prerequisites

Run

Train

Download the Corpus

Run the data prep stage

Run the training stage

Run the Evaluation

Detailed Dataflow Diagram

Owner

Paul Hyung Yuel Kim

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

OpenAI CLIP text encoders for multiple languages!

Lingtrain Aligner — ML powered library for the accurate texts alignment.

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

A Practitioner's Guide to Natural Language Processing

Label data using HuggingFace's transformers and automatically get a prediction service

NLTK Source

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Black for Python docstrings and reStructuredText (rst).

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Spert NLP Relation Extraction API deployed with torchserve for inference