Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Last update: Dec 19, 2022

Related tags

Overview

cim-misspelling

Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022.

This model (CIM) corrects misspellings with a char-based language model and a corruption model (edit distance). The model is being pre-trained and evaluated on clinical corpus and datasets. Please see the paper for more detailed explanation.

Requirements

Python 3.8 and packages in requirements.txt
The MIMIC-III dataset (v1.4): PhysioNet link
BlueBERT: GitHub link
The SPECIALIST Lexicon of UMLS: LSG website
English dictionary (DWYL): GitHub link

How to Run

Clone the repo

$ git clone --recursive https://github.com/dalgu90/cim-misspelling.git

Data preparing

Download the MIMIC-III dataset from PhysioNet, especially NOTEEVENTS.csv and put under data/mimic3
Download LRWD and prevariants of the SPECIALIST Lexicon from the LSG website (2018AB version) and put under data/umls.
Download the English dictionary english.txt from here (commit 7cb484d) and put under data/english_words.
Run scripts/build_vocab_corpus.ipynb to build the dictionary and split the MIMIC-III notes into files.
Run the Jupyter notebook for the dataset that you want to download/pre-process:
- MIMIC-III misspelling dataset, or ClinSpell (Fivez et al., 2017): scripts/preprocess_clinspell.ipynb
- CSpell dataset (Lu et al., 2019): scripts/preprocess_cspell.ipynb
- Synthetic misspelling dataset from the MIMIC-III: scripts/synthetic_dataset.ipynb
Download the BlueBERT model from here under bert/ncbi_bert_{base|large}.
- For CIM-Base, please download "BlueBERT-Base, Uncased, PubMed+MIMIC-III"
- For CIM-Large, please download "BlueBERT-Large, Uncased, PubMed+MIMIC-III"

Pre-training the char-based LM on MIMIC-III

Please run pretrain_cim_base.sh (CIM-Base) or pretrain_cim_large.sh(CIM-Large) and to pretrain the character langauge model of CIM. The pre-training will evaluate the LM periodically by correcting synthetic misspells generated from the MIMIC-III data. You may need 2~4 GPUs (XXGB+ GPU memory for CIM-Base and YYGB+ for CIM-Large) to pre-train with the batch size 256. There are several options you may want to configure:

num_gpus: number of GPUs
batch_size: batch size
training_step: total number of steps to train
init_ckpt/init_step: the checkpoint file/steps to resume pretraining
num_beams: beam search width for evaluation
mimic_csv_dir: directory of the MIMIC-III csv splits
bert_dir: directory of the BlueBERT files

You can also download the pre-trained LMs and put under model/:

Misspelling Correction with CIM

Please specify the dataset dir and the file to evaluate in the evaluation script (eval_cim_base.sh or eval_cim_large.sh), and run the script.
You may want to set init_step to specify the checkpoint you want to load

Cite this work

@InProceedings{juyong2022context,
  title = {Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence},
  author = {Kim, Juyong and Weiss, Jeremy C and Ravikumar, Pradeep},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  pages = {234--247},
  year = {2022},
  volume = {174},
  series = {Proceedings of Machine Learning Research},
  month = {07--08 Apr},
  publisher = {PMLR}
}

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

Related tags

Overview

cim-misspelling

Requirements

How to Run

Clone the repo

Data preparing

Pre-training the char-based LM on MIMIC-III

Misspelling Correction with CIM

Cite this work

Owner

Juyong Kim

Photographic Image Synthesis with Cascaded Refinement Networks - Pytorch Implementation

Implement object segmentation on images using HOG algorithm proposed in CVPR 2005

Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚡🔥⚡

[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Exploring Simple 3D Multi-Object Tracking for Autonomous Driving (ICCV 2021)

"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

Flask101 - FullStack Web Development with Python & JS - From TAQWA

💡 Learnergy is a Python library for energy-based machine learning models.

Extreme Lightwegith Portrait Segmentation

Pytorch implementation of Straight Sampling Network For Point Cloud Learning (ICIP2021).

Clean and readable code for Decision Transformer: Reinforcement Learning via Sequence Modeling

A collection of Google research projects related to Federated Learning and Federated Analytics.

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

This repository introduces a short project about Transfer Learning for Classification of MRI Images.

scalingscattering