[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

Overview

RoSTER

The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, published in EMNLP 2021.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.

Reproducing the Results

The three datasets used in the paper can be found under the data directory. We provide three bash scripts run_conll.sh, run_onto.sh and run_wikigold.sh for running the model on the three datasets.

Note: Our model does not use any ground truth training/valid/test set labels but only distant labels; we provide the ground truth label files only for completeness and evaluation.

The training bash scripts assume you use one GPU for training (a GPU with around 20GB memory would be sufficient). If your GPUs have smaller memory sizes, try increasing gradient_accumulation_steps or using more GPUs (by setting the CUDA_VISIBLE_DEVICES environment variable). However, the train_batch_size should be always kept as 32.

Command Line Arguments

The meanings of the command line arguments will be displayed upon typing

python src/train.py -h

The following arguments are important and need to be set carefully:

  • train_batch_size: The effective training batch size after gradient accumulation. Usually 32 is good for different datasets.
  • gradient_accumulation_steps: Increase this value if your GPU cannot hold the training batch size (while keeping train_batch_size unchanged).
  • eval_batch_size: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.
  • max_seq_length: This argument controls the maximum length of sequence fed into the model (longer sequences will be truncated). Ideally, max_seq_length should be set to the length of the longest document (max_seq_length cannot be larger than 512 under RoBERTa architecture), but using larger max_seq_length also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing max_seq_length.
  • noise_train_epochs, ensemble_train_epochs, self_train_epochs: They control how many epochs to train the model for noise-robust training, ensemble model trianing and self-training, respectively. Their default values will be a good starting point for most datasets, but you may increase them if your dataset is small (e.g., Wikigold dataset) and decrease them if your dataset is large (e.g., OntoNotes dataset).
  • q, tau: Hyperparameters used for noise-robust training. Their default values will be a good starting point for most datasets, but you may use higher values if your dataset is more noisy and use lower values if your dataset is cleaner.
  • noise_train_update_interval, self_train_update_interval: They control how often to update training label weights in noise-robust training and compute soft labels in soft-training, respectively. Their default values will be a good starting point for most datasets, but you may use smaller values (more frequent updates) if your dataset is small (e.g., Wikigold dataset).

Other arguments can be kept as their default values.

Running on New Datasets

To execute the code on a new dataset, you need to

  1. Create a directory named your_dataset under data.
  2. Prepare a training corpus train_text.txt (one sequence per line; words separated by whitespace) and the corresponding distant label train_label_dist.txt (one sequence per line; labels separated by whitespace) under your_dataset for training the NER model.
  3. Prepare an entity type file types.txt under your_dataset (each line contains one entity type; no need to include O class; no need to prepend I-/B- to type names). The entity type names need to be consistant with those in train_label_dist.txt.
  4. (Optional) You can choose to provide a test corpus test_text.txt (one sequence per line) with ground truth labels test_label_true.txt (one sequence per line; labels separated by whitespace). If the test corpus is provided and the command line argument do_eval is turned on, the code will display evaluation results on the test set during training, which is useful for tuning hyperparameters and monitoring the training progress.
  5. Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the three example scripts).
  6. The final trained classification model will be saved as final_model.pt under the output directory specified by the command line argument output_dir.

You can always refer to the example datasets when preparing your own datasets.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2021distantly,
  title={Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Wang, Xuan and Zhang, Yu and Ji, Heng and Han, Jiawei},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  year={2021},
}
Owner
Yu Meng
Ph.D. student, Text Mining
Yu Meng
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM Automatic Evaluation Metric described in the papers BaryScore (EM

Pierre Colombo 28 Dec 28, 2022
😊 Python module for face feature changing

PyWarping Python module for face feature changing Installation pip install pywarping If you get an error: No such file or directory: 'cmake': 'cmake',

Dopevog 10 Sep 10, 2021
Memory-Augmented Model Predictive Control

Memory-Augmented Model Predictive Control This repository hosts the source code for the journal article "Composing MPC with LQR and Neural Networks fo

Fangyu Wu 1 Jun 19, 2022
PyTorch implementation of ICLR 2022 paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 Oral paper PiCO; also see our Project

王皓波 147 Jan 07, 2023
PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

PySOT is a software system designed by SenseTime Video Intelligence Research team. It implements state-of-the-art single object tracking algorit

STVIR 4.1k Dec 29, 2022
Implementation of the bachelor's thesis "Real-time stock predictions with deep learning and news scraping".

Real-time stock predictions with deep learning and news scraping This repository contains a partial implementation of my bachelor's thesis "Real-time

David Álvarez de la Torre 0 Feb 09, 2022
Official PyTorch implementation of "Evolving Search Space for Neural Architecture Search"

Evolving Search Space for Neural Architecture Search Usage Install all required dependencies in requirements.txt and replace all ..path/..to in the co

Yuanzheng Ci 10 Oct 24, 2022
PaRT: Parallel Learning for Robust and Transparent AI

PaRT: Parallel Learning for Robust and Transparent AI This repository contains the code for PaRT, an algorithm for training a base network on multiple

Mahsa 0 May 02, 2022
Node Editor Plug for Blender

NodeEditor Blender的程序化建模插件 Show Current 基本框架:自定义的tree-node-socket、tree中的node与socket采用字典查询、基于socket入度的拓扑排序 数据传递和处理依靠Tree中的字典,socket传递字典key TODO 增加更多的节点

Cuimi 11 Dec 03, 2022
LoL Runes Recommender With Python

LoL-Runes-Recommender Para ejecutar la aplicación se debe llamar a execute_app.p

Sebastián Salinas 1 Jan 10, 2022
git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

USD-Seg This project is an implement of paper USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation, based on FCOS detector f

Ruolin Ye 80 Nov 28, 2022
This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans. TABS relies on a Res-Unet backbone, with a Vision

6 Nov 07, 2022
ICCV2021 - Mining Contextual Information Beyond Image for Semantic Segmentation

Introduction The official repository for "Mining Contextual Information Beyond Image for Semantic Segmentation". Our full code has been merged into ss

55 Nov 09, 2022
Continuous Diffusion Graph Neural Network

We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE.

Twitter Research 227 Jan 05, 2023
Flybirds - BDD-driven natural language automated testing framework, present by Trip Flight

Flybird | English Version 行为驱动开发(Behavior-driven development,缩写BDD),是一种软件过程的思想或者

Ctrip, Inc. 706 Dec 30, 2022
TDN: Temporal Difference Networks for Efficient Action Recognition

TDN: Temporal Difference Networks for Efficient Action Recognition Overview We release the PyTorch code of the TDN(Temporal Difference Networks).

Multimedia Computing Group, Nanjing University 326 Dec 13, 2022
Code to generate datasets used in "How Useful is Self-Supervised Pretraining for Visual Tasks?"

Synthetic dataset rendering Framework for producing the synthetic datasets used in: How Useful is Self-Supervised Pretraining for Visual Tasks? Alejan

Princeton Vision & Learning Lab 21 Apr 29, 2022
A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Requirements pytorch 1.1+ torchvision 0.3+ pyclipper opencv3 gcc

zhoujun 400 Dec 26, 2022
This project hosts the code for implementing the ISAL algorithm for object detection and image classification

Influence Selection for Active Learning (ISAL) This project hosts the code for implementing the ISAL algorithm for object detection and image classifi

25 Sep 11, 2022