This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

Last update: Nov 28, 2022

Related tags

Deep Learning FlatTN

Overview

FlatTN

This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer" published on ICASSP 2022.

Requirement

Python: 3.7.3
PyTorch: 1.2.0
FastNLP: 0.5.0
Numpy: 1.16.4
fitlog

For more about FastNLP, please visit here. For Fitlog, please refer to this.

Dataset download

We release a large-scale Chinese Text Normalization (TN) Dataset in corporatioin with Databaker (Beijing) Technology Co., Ltd.

To download the dataset, please visit https://www.data-baker.com/en/#/data/index/TNtts.

(For Chinese version of the download page, please visit https://www.data-baker.com/data/index/TNtts.)

Data preprocessing

The raw dataset in jsonl format are saved at: dataset/processed/CN_TN_epoch-01-28645_2.jsonl

We preprocessed the data into the BMES format, and divided the data into train 、dev 、test by 8:1:1.

dataset/processed/shuffled_BMES
                      ├── train.char.bmes
                      ├── dev.char.bmes
                      └── test.char.bmes

An example of the processed data in BMES format is as follows:

2 B-DIGIT
0 M-DIGIT
1 M-DIGIT
5 E-DIGIT
年 S-SELF
， S-PUNC
只 S-SELF
剩 S-SELF
3 B-CARDINAL
9 E-CARDINAL
天 S-SELF
。 S-PUNC

You can re-run our code to preprocess and divide the raw dataset again:

cd dataset/processed
python preprocess.py

You can also used the following code to get statistics of all NSW categories of the data:

cd dataset/processed
python stat.py

Training

Our code are in version V1, run training code

cd V1
python flat_main.py --dataset databaker

Our proposed rule base are saved in a python file: V1/add_rule.py

Acknowledgement

Our code is based on Flat-Lattice-Transformer (FLAT) from LeeSureman.

For more information about FLAT, please refer to LeeSureman/Flat-Lattice-Transformer.

This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

Related tags

Overview

FlatTN

Requirement

Dataset download

Data preprocessing

Training

Acknowledgement

Owner

THUHCSI

A particular navigation route using satellite feed and can help in toll operations & traffic managemen

Pytorch based library to rank predicted bounding boxes using text/image user's prompts.

Learnable Boundary Guided Adversarial Training (ICCV2021)

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Goal of the project : Detecting Temporal Boundaries in Sign Language videos

A custom DeepStack model that has been trained detecting ONLY the USPS logo

Open source person re-identification library in python

Finite-temperature variational Monte Carlo calculation of uniform electron gas using neural canonical transformation.

Open source Python implementation of the HDR+ photography pipeline

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations.

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Official implementation for Scale-Aware Neural Architecture Search for Multivariate Time Series Forecasting

Official Python implementation of the FuzionCoin protocol

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.