RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Overview

version bert

RoNER

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, high-accuracy Python package providing Romanian NER.

RoNER handles text splitting, word-to-subword alignment, and it works with arbitrarily long text sequences on CPU or GPU.

Instalation & usage

Install with: pip install roner

Run with:

20} = {word['tag']}")">
import roner
ner = roner.NER()

input_texts = ["George merge cu trenul Cluj - Timișoara de ora 6:20.", 
               "Grecia are capitala la Atena."]

output_texts = ner(input_texts)

for output_text in output_texts:
  print(f"Original text: {output_text['text']}")
  for word in output_text['words']:
    print(f"{word['text']:>20} = {word['tag']}")

RoNEC input

RoNER accepts either strings or lists of strings as input. If you pass a single string, it will convert it to a list containing this string.

RoNEC output

RoNER outputs a list of dictionary objects corresponding to the given input list of strings. A dictionary entry consists of:

>, "input_ids": < >, "words": [{ "text": < >, "tag": < > "pos": < >, "multi_word_entity": < >, "span_after": < >, "start_char": < >, "end_char": < >, "token_ids": < >, "tag_ids": < > }] }">
{
  "text": <
             
              >,
             
  "input_ids": <
             
              >,
             
  "words": [{
      "text": <
             
              >,
             
      "tag": <
             
              >
             
      "pos": <
             
              >,
             
      "multi_word_entity": <
             
              >,
             
      "span_after": <>,
      "start_char": <
              
               >,
              
      "end_char": <
              
               >,
              
      "token_ids": <
              
               >,
              
      "tag_ids": <
              
               >
              
    }]
}

This information is sufficient to save word-to-subtoken alignment, to have access to the original text as well as having other usable info such as the start and end positions for each word.

To list entities, simply iterate over all the words in the dict, printing the word itself word['text'] and its label word['tag'].

RoNER properties and considerations

Constructor options

The NER constructor has the following properties:

  • model:str Override this if you want to use your own pretrained model. Specify either a HuggingFace model or a folder location. If you use a different tag set than RONECv2, you need to also override the bio2tag_list option. The default model is dumitrescustefan/bert-base-romanian-ner
  • use_gpu:bool Set to True if you want to use the GPU (much faster!). Default is enabled; if there is no GPU found, it falls back to CPU.
  • batch_size:int How many sequences to process in parallel. On an 11GB GPU you can use batch_size = 8. Default is 4. Larger values mean faster processing - increase until you get OOM errors.
  • window_size:int Model size. BERT uses by default 512. Change if you know what you're doing. RoNER uses this value to compute overlapping windows (will overlap last quarter of the window).
  • num_workers:int How many workers to use for feeding data to GPU/CPU. Default is 0, meaning use the main process for data loading. Safest option is to leave at 0 to avoid possible errors at forking on different OSes.
  • named_persons_only:bool Set to True to output only named persons labeled with the class PERSON. This parameter is further explained below.
  • verbose:bool Set to True to get processing info. Leave it at its default False value for peace and quiet.
  • bio2tag_list:list Default None, change only if you trained your own model with different ordering of the BIO2 tags.

Implicit tokenization of texts

Please note that RoNER uses Stanza to handle Romanian tokenization into words and part-of-speech tagging. On first run, it will download not only the NER transformer model, but also Stanza's Romanian data package.

'PERSON' class handling

An important aspect that requires clarification is the handling of the PERSON label. In RONECv2, persons are not only names of persons (proper nouns, aka George Mihailescu), but also any common noun that refers to a person, such as ea, fratele or doctorul. For applications that do not need to handle this scenario, please set the named_persons_only value to True in RoNER's constructor.

What this does is use the part of speech tagging provided by Stanza and only set as PERSONs proper nouns.

Multi-word entities

Sometimes, entities span multiple words. To handle this, RoNER has a special property named multi_word_entity, which, when True, means that the current entity is linked to the previous one. Single-word entities will have this property set to False, as will the first word of multi-word entities. This is necessary to distinguish between sequential multi-word entities.

Detokenization

One particular use-case for a NER is to perform text anonymization, which means to replace entities with their label. With this in mind, RoNER has a detokenization function, that, applied to the outputs, will recreate the original strings.

To perform the anonymization, iterate through all the words, and replace the word's text with its label as in word['text'] = word['tag']. Then, simply run anonymized_texts = ner.detokenize(outputs). This will preserve spaces, new-lines and other characters.

NER accuracy metrics

Finally, because we trained the model on a modified version of RONECv2 (we performed data augumentation on the sentences, used a different training scheme and other train/validation/test splits) we are unable to compare to the standard baseline of RONECv2 as part of the original test set is now included in our training data, but we have obtained, to our knowledge, SOTA results on Romanian. This repo is meant to be used in production, and not for comparisons to other models.

BibTeX entry and citation info

Please consider citing the following paper as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using a model trained on v2 by the same authors:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}
Owner
Stefan Dumitrescu
Machine Learning, NLP
Stefan Dumitrescu
Exploration of BERT-based models on twitter sentiment classifications

twitter-sentiment-analysis Explore the relationship between twitter sentiment of Tesla and its stock price/return. Explore the effect of different BER

Sammy Cui 2 Oct 02, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

Microsoft 281 Dec 15, 2022
Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

Margarita Humanitarian Foundation 6 Dec 17, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022