Natural Language Processing for Adverse Drug Reaction (ADR) Detection

This repo contains code from a project to identify ADRs in discharge summaries at Austin Health. The model uses the HuggingFace Transformers library, beginning with the pretrained DeBERTa model. Further MLM pre-training is performed on a large corpus of unannotated discharge summaries. Finally, fine-tuning is peformed on a corpus of annotated discharge summaries (annotated using Prodigy). The model performs NER, but final performance is measured at the document level using the maximum token-level score.

We used Weights and Biases for experiment tracking.

The pretrain script takes a folder containing discharge summaries stored in CSV folders, tokenizes and continues MLM training on deberta-base.

Fine-tuning can then be performed with the finetune script using CLI commands. This script assumes the data is either a JSONL file of annotated text exported from Prodigy (--datafile example.jsonl), or a saved HuggingFace Datasets. If you run this script once on a JSONL file of annotations, you can choose to save the Dataset into a folder (--save_data_dir "save_to_here") and use this for subsequent training runs (--datafile "save_to_here").

Example usage:

python .\finetune.py --folds 5 --epochs 15 --lr 5e-5 --wandb_on --hub_off --project 'CLI Tests' --run_name cross-validation --datafile 'data'

Note: you might find that your exported annotations (JSONL file) is not encoded using UTF-8, which will prevent this code from working. There are various methods to change the encoding and these can all be found with a quick Google search. On a windows machine, for example, modify the following in powershell:

Get-Content .\name_of_file.jsonl -Encoding Unicode | Set-Content -Encoding UTF8 .\name_of_new_file.jsonl

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Related tags

Overview

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Owner

Medicines Optimisation Service - Austin Health

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

A tool helps build a talk preview image by combining the given background image and talk event description

基于Transformer的单模型、多尺度的VAE模型

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Simple Annotated implementation of GPT-NeoX in PyTorch

Translate U is capable of translating the text present in an image from one language to the other.

This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Autoregressive Entity Retrieval

Checking spelling of form elements

自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

This is a simple item2vec implementation using gensim for recbole

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization