Nested Named Entity Recognition for Chinese Biomedical Text

Last update: Dec 25, 2022

Related tags

Overview

CBio-NAMER

CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understanding Evaluation), a benchmark of Nested Named Entity Recognition. We got the 2nd price of the benchmark by 2021/12/07. Single model CBioNAMER also achieves top20 in CBLUE. The score of CBioNAMER has surpassed human(67.0 in F1-score).

Result

Results of our method:

Results of our single model CBioNAMER:

Approach

CBioNAMER is a sub-model in our result, which is based on GlobalPointer (a powerful open-source model, thanks for author, we rewrite it with Pytorch) and MacBert.

Usage

First, install PyTorch>=1.7.0. There's no restriction on GPU or CUDA.

Then, install this repo as a Python package:

$ pip install CBioNAMER

Python package transformers==4.6.1 would be automatically installed as well.

API

The CBioNAMER package provides the following methods:

CBioNAMER.load_NER(model_save_path='./checkpoint/macbert-large_dict.pth', maxlen=512, c_size=9, id2c=_id2c, c2c=_c2c)

Returns the pretrained model. It will download the model as necessary. The model would use the first CUDA device if there's any, otherwise using CPU instead.

The model_save_path argument specifies the path of the pretrained model weight.

The maxlen argument specifies the max length of input sentences. The sentences longer than maxlen would be cut off.

The c_size argument specifies the number of entity class. Here is 9 for CBLUE.

The id2c argument specifies the mapping between id and entity class. By default, the id2c argument for CBLUE is:

_id2c = {0: 'dis', 1: 'sym', 2: 'pro', 3: 'equ', 4: 'dru', 5: 'ite', 6: 'bod', 7: 'dep', 8: 'mic'}

The c2c argument specifies the mapping between entity class and its Chinese meaning. By default, the c2c argument for CBLUE is:

_c2c = {'dis': "疾病", 'sym': "临床表现", 'pro': "医疗程序", 'equ': "医疗设备", 'dru': "药物", 'ite': "医学检验项目", 'bod': "身体", 'dep': "科室", 'mic': "微生物类"}

The model returned by CBioNAMER.load_NER() supports the following methods:

model.recognize(text: str, threshold=0)

Given a sentence, returns a list of dictionaries with recognized entity, the format of the dictionary is {'start_idx': entity's starting index, 'end_idx': entity's ending index, 'type': entity class, 'Chinese_type': Chinese meaning of entity class, 'entity': recognized entity}. The threshold argument specifies that the returned list only contains the recognized entity with confidence score higher than threshold.

model.predict_to_file(in_file: str, out_file: str)

Given input and output .json file path, the model would do inference according in_file, and the recognized entity would be saved in out_file. The output file can be submitted to CBLUE. The format of input file is like:

[
  {
    "text": "该技术的应用使某些遗传病的诊治水平得到显著提高。"
  },
    ...
  {
    "text": "There is a sentence."
  }
]

Examples

import CBioNAMER

NER = CBioNAMER.load_NER()
in_file = './CMeEE_test.json'
out_file = './CMeEE_test_answer.json'
NER.predict_to_file(in_file, out_file)

import CBioNAMER

NER = CBioNAMER.load_NER()
text = "该技术的应用使某些遗传病的诊治水平得到显著提高。"
recognized_entity = NER.recognize(text)
print(recognized_entity)
# output:[{'start_idx': 9, 'end_idx': 11, 'type': 'dis', 'Chinese_type': '疾病', 'entity': '遗传病'}]

You might also like...

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

1.4k Feb 17, 2021

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.5k Feb 17, 2021

Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

1.1k Dec 25, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

Releases(v0.0.1)

v0.0.1(Nov 24, 2021)

Please use macbert-large_dict.pth as the pretrained model's weight and load_state_dict().

The source code of the release is out of date.
Source code(tar.gz)
Source code(zip)
macbert-large_dict.pth(1246.43 MB)

Nested Named Entity Recognition for Chinese Biomedical Text

Related tags

Overview

CBio-NAMER

Result

Approach

Usage

API

Examples

You might also like...

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Pytorch-Named-Entity-Recognition-with-BERT

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Named Entity Recognition API used by TEI Publisher

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Releases(v0.0.1)

v0.0.1(Nov 24, 2021)

Owner

Azure Text-to-speech service for Home Assistant

Speech to text streamlit app

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

Py65 65816 - Add support for the 65C816 to py65

SimBERT升级版（SimBERTv2）！

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

A highly sophisticated sequence-to-sequence model for code generation

Topic Modelling for Humans

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Transformers and related deep network architectures are summarized and implemented here.

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Trained T5 and T5-large model for creating keywords from text

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

基于百度的语音识别，用python实现，pyaudio+pyqt