This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview

GeNER

This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview of GeNER

GeNER allows you to build NER models for specific entity types of interest without human-labeled data and and rich dictionaries. The core idea is to ask simple natural language questions to an open-domain question answering (QA) system and then retrieve phrases and sentences, as shown in the query formulation and retrieval stages in the figure below. Please see our paper (Simple Questions Generate Named Entity Recognition Datasets) for details.

Requirements

Please follow the instructions below to set up your environment and install GeNER.

# Create a conda virtual environment
conda create -n GeNER python=3.8
conda activate GeNER

# Install PyTorch
conda install pytorch=1.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge

# Install GeNER
git clone https://github.com/dmis-lab/GeNER.git
cd GeNER
pip install -r requirements.txt

NER Benchmarks

Run unzip data/benchmarks.zip -d ./data to unpack (pre-processed) NER benchmarks.

QA Model and Phrase Index: DensePhrases

We use DensePhrases and a Wikipedia index precomputed by DensePhrases in order to automatically generate NER datasets. After installing DensePhrases v1.0.0, please download the DensePhrases model (densephrases-multi-query-multi) and the phrase index (densephrases-multi_wiki-20181220) in the official DensePhrases repository.

AutoPhrase (Optional)

Using AutoPhrase in the dictionary matching stage usually improves final NER performance. If you are using AutoPhrase to apply Rule 10 (i.e., refining entity boundaries), please check the system requirements in the AutoPhrase repository. If you are not using AutoPhrase, set refine_boundary to false in a configuration file in the configs directory.

Computational Resource

Please see the resource requirement of DensePhrases and self-training, and check available resources of your machine.

  • 100GB RAM and a single 11G GPU to run DensePhrases
  • Single 9G GPU to perform self-training (based on batch size 16)

Reproducing Experiments

GeNER is implemented as a pipeline of DensePhrases, dictionary matching, and AutoPhrase. The entire pipeline is controlled by configuration files located in the configs directory. Please see configs/README.md for details.

We have already set up configuration files and optimal hyperparameters for all benchmarks and experiments so that you can easily reproduce similar or better performance to those presented in our paper. Just follow the instructions below for reproduction!

Example: low-resource NER (CoNLL-2003)

This example is intended to reproduce the experiment in the low-resource NER setting on the CoNLL-2003 benchmark. If you want to reproduce other experiments, you will need to change some arguments including --gener_config_path according to the target benchmark.

Retrieval

Running retrieve.py will create *.json and *.raw files in the data/retrieved/conll-2003 directory.

export CUDA_VISIBLE_DEVICES=0
export DENSEPHRASES_PATH={enter your densephrases path here}
export CONFIG_PATH=./configs/conll_config.json

python retrieve.py \
      --run_mode eval \
      --model_type bert \
      --cuda \
      --aggregate \
      --truecase \
      --return_sent \
      --pretrained_name_or_path SpanBERT/spanbert-base-cased \
      --dump_dir $DENSEPHRASES_PATH/outputs/densephrases-multi_wiki-20181220/dump/ \
      --index_name start/1048576_flat_OPQ96  \
      --load_dir $DENSEPHRASES_PATH/outputs/densephrases-multi-query-multi/  \
      --gener_config_path $CONFIG_PATH

Applying AutoPhrase (optional)

apply_autophrase.sh takes as input all *.raw files in the data/retrieved/conll-2003 directory and outputs *.autophrase files in the same directory.

bash autophrase/apply_autophrase.sh data/retrieved/conll-2003

Dictionary matching

Running annotate.py will create train.json and train_hf.json files in the data/annotated/conll-2003 directory. The first JSON file is used in this repository, especially in the self-training stage. The second one has the same data format as the Hugging Face Transformers library and is provided for your convenience.

python annotate.py --gener_config_path $CONFIG_PATH

Self-training

Finally, you can get the final NER model and see its performance. The model and training logs are stored in the ./outputs directory. See the Makefile file for running experiments on other benchmarks.

make conll-low

Fine-tuning GeNER

While GeNER performs well without any human-labeled data, you can further boost GeNER's performance using some training examples. The way to do this is very simple: load a trained GeNER model from the ./outputs directory and fine-tune it on training examples you have by a standard NER objective (i.e., token classification). We provide a fine-tuning script in this repository (self-training/run_ner.py) and datasets to reproduce fine-grained and few-shot NER experiments (data/fine-grained and data/few-shot directories).

export CUDA_VISIBLE_DEVICES=0

python self-training/run_ner.py \
      --data_dir data/few-shot/conll-2003/conll-2003_0 \
      --model_type bert \
      --model_name_or_path outputs/{enter GeNER model path here} \
      --output_dir outputs/{enter GeNER model path here} \
      --num_train_epochs 100 \
      --per_gpu_train_batch_size 64 \
      --per_gpu_eval_batch_size 64 \
      --learning_rate 1e-5 \
      --do_train \
      --do_eval \
      --do_test \
      --evaluate_during_training

# Note that this hyperparameter setup may not be optimal. It is recommended to search for more effective hyperparameters, especially the learning rate.

Building NER Models for Your Specific Needs

The main benefit of GeNER is that you can create NER datasets of new and different entity types you want to extract. Suppose you want to extract fighter aircraft names. The first thing you have to do is to formulate your needs as natural language questions such as "Which fighter aircraft?." At this stage, we recommend using the DensePhrases demo to manually check the feasibility of your questions. If relevant phrases are retrieved well, you can proceed to the next step.

Next, you should make a configuration file (e.g., fighter_aircraft_config.json) and set up its values. You can reflect questions you made in the configuration file as follows: "subtype": "fighter aircraft". Also, you can fine-tune some hyperparameters such as top_k and normalization rules. See configs/README.md for detailed descriptions of configuration files.

{
    "retrieved_path": "data/retrieved/{file name}",
    "annotated_path": "data/annotated/{file name}",
    "add_abbreviation": true,
    "refine_boundary" : true,
    "subquestion_configs": [
        {
            "type": "{the name of pre-defined entity type}",
            "subtype" : "fighter aircraft",
            "top_k" : 5000,
            "split_composite_mention": true,
            "remove_lowercase_phrase": true,
            "remove_the": false,
            "skip_lowercase_ngram": 1
        }
    ]
}

For subsequent steps (i.e., retrieval, dictionary matching, and self-training), refer to the CoNLL-2003 example described above.

References

Please cite our paper if you consider GeNER to be related to your work. Thanks!

@article{kim2021simple,
      title={Simple Questions Generate Named Entity Recognition Datasets}, 
      author={Hyunjae Kim and Jaehyo Yoo and Seunghyun Yoon and Jinhyuk Lee and Jaewoo Kang},
      year={2021},
      eprint={2112.08808},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Feel free to email Hyunjae Kim ([email protected]) if you have any questions.

License

See the LICENSE file for details.

Owner
DMIS Laboratory - Korea University
Data Mining & Information Systems Laboratory @ Korea University
DMIS Laboratory - Korea University
Orbivator AI - To Determine which features of data (measurements) are most important for diagnosing breast cancer and find out if breast cancer occurs or not.

Orbivator_AI Breast Cancer Wisconsin (Diagnostic) GOAL To Determine which features of data (measurements) are most important for diagnosing breast can

anurag kumar singh 1 Jan 02, 2022
PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

Toru 8 Dec 29, 2022
Python PID Tuner - Based on a FOPDT model obtained using a Open Loop Process Reaction Curve

PythonPID_Tuner Step 1: Takes a Process Reaction Curve in csv format - assumes data at 100ms interval (column names CV and PV) Step 2: Makes a rough e

6 Jan 14, 2022
PyTorch META-DATASET (Few-shot classification benchmark)

PyTorch META-DATASET (Few-shot classification benchmark) This repo contains a PyTorch implementation of meta-dataset and a unified implementation of s

Malik Boudiaf 39 Oct 31, 2022
Empowering journalists and whistleblowers

Onymochat Empowering journalists and whistleblowers Onymochat is an end-to-end encrypted, decentralized, anonymous chat application. You can also host

Samrat Dutta 19 Sep 02, 2022
GAN-generated image detection based on CNNs

GAN-image-detection This repository contains a GAN-generated image detector developed to distinguish real images from synthetic ones. The detector is

Image and Sound Processing Lab 17 Dec 15, 2022
Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

VidLanKD Implementation of VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer by Zineng Tang, Jaemin Cho, Hao Tan, Mohi

Zineng Tang 54 Dec 20, 2022
disentanglement_lib is an open-source library for research on learning disentangled representations.

disentanglement_lib disentanglement_lib is an open-source library for research on learning disentangled representation. It supports a variety of diffe

Google Research 1.3k Dec 28, 2022
Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Curso em Vídeo - Exercícios de Python 3 Sobre o repositório Este repositório contém os notebooks com a listas de exercícios propostos pelo professor G

João Pedro Pereira 9 Oct 15, 2022
Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL"

Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL" This is the official codebase for Pessimism Meets I

3 Sep 19, 2022
TensorFlow implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Source: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Aritra Roy Gosthipaty 23 Dec 24, 2022
KaziText is a tool for modelling common human errors.

KaziText KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatic

ÚFAL 3 Nov 24, 2022
A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud.

Lidar with Velocity A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud. related paper: Lidar with Velocity : Motion

ISEE Research Group 164 Dec 30, 2022
RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again (PyTorch) This is a super simple ConvNet architecture that achieves over 80% top-1 accuracy on ImageNet

2.8k Jan 04, 2023
In the case of your data having only 1 channel while want to use timm models

timm_custom Description In the case of your data having only 1 channel while want to use timm models (with or without pretrained weights), run the fol

2 Nov 26, 2021
DeRF: Decomposed Radiance Fields

DeRF: Decomposed Radiance Fields Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, Andrea Tagliasacchi Links Paper Project Page Abstract

UBC Computer Vision Group 24 Dec 02, 2022
Augmented CLIP - Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.

Train aug_clip against laion400m-embeddings found here: https://laion.ai/laion-400-open-dataset/ - note that this used the base ViT-B/32 CLIP model. S

Peter Baylies 55 Sep 13, 2022
Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

LADA This repo contains codes for the following paper: Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augm

GT-SALT 36 Dec 02, 2022
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

NVIDIA Corporation 15.6k Jan 01, 2023
Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

DocEnTR Description Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer. This model is implemented on to

Mohamed Ali Souibgui 74 Jan 07, 2023