CoNLL-English NER Task

en | ch

Motivation

Course Project
review the pytorch framework and sequence-labeling task
practice using the transformers of Huggingface

Dataset Introduction

A train set, a test set and a validation set in the data file

-DOCSTART- -X- O O
-sentnce- -pos- -Chuck- -Entity-

Project Structure

-data  # source data
-emb # BERT model files

-util
    -dataTool.py  # data interface
    -model.py
    -trainer.py  # train and evaluate

config.py  # parameters in the project
run.py
requirement.txt

EDA.ipynb # exploratory data analasis, 
          # which aims to confirm the hyper-params in the trials

Coding Pattern

For keeping the convenience and simplicity of experiments,
decouple the model into two units: encoder and tagger

model ==> encoder + tagger

In such a way, encoder extracts the context and linguistit features,
which will be received by tagger to output BIO tags.

Usage

chmod 755 deploy
./deploy

./gpu n  # monitor the GPU (refresh every n seconds)
./run  # start

Baseline Performance (1 ep | macro)

Model	Precision	Recall	F1
Bert-CRF	0.71	0.68	0.69
Bert-softmax	-	-	-
Bert-BiLSTM-CRF	-	-	-
Bert-BiLSTM-softmax	-	-	-

Optimization

cost sensitive learning or drop the few classes
dropout to improve the generalization performance
different backbone structures
DDP training --> large GPU caches for a large batch_size
more epochs --> schedule the learning rate dynamically while training

CoNLL-English NER Task (NER in English)

Related tags

Overview

CoNLL-English NER Task

Motivation

Dataset Introduction

Project Structure

Coding Pattern

Usage

Baseline Performance (1 ep | macro)

Optimization

Owner

Kevin

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

A highly sophisticated sequence-to-sequence model for code generation

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

Toward Model Interpretability in Medical NLP

초성 해석기 based on ko-BART

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

A Chinese to English Neural Model Translation Project

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Tracking Progress in Natural Language Processing

Natural Language Processing at EDHEC, 2022

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

skweak: A software toolkit for weak supervision applied to NLP tasks

An extensive UI tool built using new data scraped from BBC News

Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

Open source code for AlphaFold.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.