Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Overview

Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

arXiv

This is the code base for weakly supervised NER.

We provide a three stage framework:

  • Stage I: Domain continual pre-training;
  • Stage II: Noise-aware weakly supervised pre-training;
  • Stage III: Fine-tuning.

In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.

See details in arXiv.

Performance Benchmark

BioMedical NER

Method (F1) BC5CDR-chem BC5CDR-disease NCBI-disease
BERT 89.99 79.92 85.87
bioBERT 92.85 84.70 89.13
PubMedBERT 93.33 85.62 87.82
Ours 94.17 90.69 92.28

See more in bio_script/README.md

Dependency

pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7

Install requirements

pip install -r requirements.txt

(If the allennlp and transformers are incompatible, install allennlp first and then update transformers. Since we only use some small functions of allennlp, it should works fine. )

File Structure:

├── bert-ner          #  Python Code for Training NER models
│   └── ...
└── bio_script        #  Shell Scripts for Training BioMedical NER models
    └── ...

Usage

See examples in bio_script

Hyperparameter Explaination

Here we explain hyperparameters used the scripts in ./bio_script.

Training Scripts:

Scripts

  • roberta_mlm_pretrain.sh
  • weak_weighted_selftrain.sh
  • finetune.sh

Hyperparameter

  • GPUID: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3.
  • MASTER_PORT: automatically constructed (avoid conflicts) for distributed training.
  • DISTRIBUTE_GPU: use distributed training or not
  • PROJECT_ROOT: automatically detected, the root path of the project folder.
  • DATA_DIR: Directory of the training data, where it contains train.txt test.txt dev.txt labels.txt weak_train.txt (weak data) aug_train.txt (optional).
  • USE_DA: if augment training data by augmentation, i.e., combine train.txt + aug_train.txt in DATA_DIR for training.
  • BERT_MODEL: the model backbone, e.g., roberta-large. See transformers for details.
  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • LOSSFUNC: nll the normal loss function, corrected_nll noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).
  • MAX_WEIGHT: The maximum weight of a sample in the loss.
  • MAX_LENGTH: max sentence length.
  • BATCH_SIZE: batch size per GPU.
  • NUM_EPOCHS: number of training epoches.
  • LR: learning rate.
  • WARMUP: learning rate warmup steps.
  • SAVE_STEPS: the frequency of saving models.
  • EVAL_STEPS: the frequency of testing on validation.
  • SEED: radnom seed.
  • OUTPUT_DIR: the directory for saving model and code. Some parameters will be automatically appended to the path.
    • roberta_mlm_pretrain.sh: It's better to manually check where you want to save the model.]
    • finetune.sh: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx.
    • weak_weighted_selftrain.sh: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx (see FBA_RULE below)

There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh).

Profiling Script

Scripts

  • profile.sh

Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py but only do evaluation.

Hyperparameter Basically the same as training script.

  • PROFILE_FILE: can be train,dev,test or a specific path to a txt data. E.g., using Weak by

    PROFILE_FILE=weak_train_100.txt PROFILE_FILE=$DATA_DIR/$PROFILE_FILE

  • OUTPUT_DIR: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile

Weakly Supervised Data Refinement Script

Scripts

  • profile2refinedweakdata.sh

Hyperparameter

  • BERT_CKP: see BERT_MODEL_PATH.
  • BERT_MODEL_PATH: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP.
  • WEI_RULE: rule for generating weight for each weak sample.
    • uni: all are 1
    • avgaccu: confidence estimate for new labels generated by all_overwrite
    • avgaccu_weak_non_O_promote: confidence estimate for new labels generated by non_O_overwrite
  • PRED_RULE: rule for generating new weak labels.
    • non_O_overwrite: non-entity ('O') is overwrited by prediction
    • all_overwrite: all use prediction, i.e., self-training
    • no: use original weak labels
    • non_O_overwrite_all_overwrite_over_accu_xx: non_O_overwrite + if confidence is higher than xx all tokens use prediction as new labels

The generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE} WEAK_RULE specified in weak_weighted_selftrain.sh is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}.

More Rounds of Training, Try Different Combination

  1. To do training with weakly supervised data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Create profile data, e.g., run ./bio_script/profile.sh for dev set and weak set
  • iii) Generate data with weak labels from profile data, e.g., run ./bio_script/profile2refinedweakdata.sh. You can use different rules to generate weights for each sample (WEI_RULE) and different rules to refine weak labels (PRED_RULE). See more details in ./ber-ner/profile2refinedweakdata.py
  • iv) Do training with ./bio_script/weak_weighted_selftrain.sh.
  1. To do fine-tuning with human labeled data from any model checkpoint directory:
  • i) Set BERT_CKP appropriately;
  • ii) Run ./bio_script/finetune.sh.

Reference

@inproceedings{Jiang2021NamedER,
  title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
  author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
  booktitle={ACL/IJCNLP},
  year={2021}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Owner
Amazon
Amazon
Official code repository for the EMNLP 2021 paper

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization PyTorch code for the EMNLP 2021 paper "Integrating Visuospatia

Adyasha Maharana 23 Dec 19, 2022
A simple Tensorflow based library for deep and/or denoising AutoEncoder.

libsdae - deep-Autoencoder & denoising autoencoder A simple Tensorflow based library for Deep autoencoder and denoising AE. Library follows sklearn st

Rajarshee Mitra 147 Nov 18, 2022
Implementation for paper "Towards the Generalization of Contrastive Self-Supervised Learning"

Contrastive Self-Supervised Learning on CIFAR-10 Paper "Towards the Generalization of Contrastive Self-Supervised Learning", Weiran Huang, Mingyang Yi

Weiran Huang 13 Nov 30, 2022
Torchyolo - Yolov3 ve Yolov4 modellerin Pytorch uygulamasıdır

TORCHYOLO : Yolo Modellerin Pytorch Uygulaması Yapılacaklar: Yolov3 model.py ve

Kadir Nar 3 Aug 22, 2022
Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al.

nam-pytorch Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al. [abs, pdf] Installation You can access nam-pytorch vi

Rishabh Anand 11 Mar 14, 2022
An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

CV Lab @ Yonsei University 35 Oct 26, 2022
PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

Toru 8 Dec 29, 2022
Code for the ECIR'22 paper "Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators"

Query Variation Generators This repository contains the code and annotation data for the ECIR'22 paper "Evaluating the Robustness of Retrieval Pipelin

Gustavo Penha 12 Nov 20, 2022
DLWP: Deep Learning Weather Prediction

DLWP: Deep Learning Weather Prediction DLWP is a Python project containing data-

Kushal Shingote 3 Aug 14, 2022
Group Activity Recognition with Clustered Spatial Temporal Transformer

GroupFormer Group Activity Recognition with Clustered Spatial-TemporalTransformer Backbone Style Action Acc Activity Acc Config Download Inv3+flow+pos

28 Dec 12, 2022
A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019)

Graph Wavelet Neural Network ⠀⠀ A PyTorch implementation of Graph Wavelet Neural Network (ICLR 2019). Abstract We present graph wavelet neural network

Benedek Rozemberczki 490 Dec 16, 2022
a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles Link to Other of my work 🌟 Deep Learning Notes: A collection of my notes going from basic mult

Ge Yang 137 Nov 09, 2022
Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions"

Supplemental Code for "ImpressionNet :A Multi view Approach to Predict Socio Facial Impressions" Environment requirement This code is based on Python

Rohan Kumar Gupta 1 Dec 19, 2021
Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

Memformer - Pytorch Implementation of Memformer, a Memory-augmented Transformer, in Pytorch. It includes memory slots, which are updated with attentio

Phil Wang 60 Nov 06, 2022
A fast implementation of bss_eval metrics for blind source separation

fast_bss_eval Do you have a zillion BSS audio files to process and it is taking days ? Is your simulation never ending ? Fear no more! fast_bss_eval i

Robin Scheibler 99 Dec 13, 2022
A PyTorch Implementation of Gated Graph Sequence Neural Networks (GGNN)

A PyTorch Implementation of GGNN This is a PyTorch implementation of the Gated Graph Sequence Neural Networks (GGNN) as described in the paper Gated G

Ching-Yao Chuang 427 Dec 13, 2022
Source code of the paper Meta-learning with an Adaptive Task Scheduler.

ATS About Source code of the paper Meta-learning with an Adaptive Task Scheduler. If you find this repository useful in your research, please cite the

Huaxiu Yao 16 Dec 26, 2022
A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A selection of State Of The Art research papers (and code) on human trajectory prediction (forecasting). Papers marked with [W] are workshop papers.

Karttikeya Manglam 40 Nov 18, 2022
Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships.

feature-set-comp Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships. Reposito

Trent Henderson 7 May 25, 2022
Yolo Traffic Light Detection With Python

Yolo-Traffic-Light-Detection This project is based on detecting the Traffic light. Pretained data is used. This application entertained both real time

Ananta Raj Pant 2 Aug 08, 2022