TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Last update: Dec 20, 2022

Overview

TweebankNLP

This repo contains the new Tweebank-NER dataset and Twitter-Stanza pipeline for state-of-the-art Tweet NLP. Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks. The Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

Installation

# please install from the source
pip install -e .

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

import stanza

# config for the `en_tweet` pipeline (trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './saved_models/ner/en_tweet_nertagger.pt'
}

# Initialize the pipeline using a configuration dict
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Running Twitter-Stanza (Command Line Interface)

NER

We provide two pre-trained Stanza NER models:

en_tweenut17: trained on TB2+WNUT17
en_tweet: trained on TB2

source twitter-stanza/scripts/config.sh

python stanza/utils/training/run_ner.py en_tweenut17 \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json

Syntactic NLP Models

We provide two pre-trained models for the following NLP tasks:

tweet_ewt: trained on TB2+UD-English-EWT
en_tweet: trained on TB2

1. Tokenization

python stanza/utils/training/run_tokenizer.py tweet_ewt \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt

2. Lemmatization

python stanza/utils/training/run_lemma.py tweet_ewt \
--mode predict \
--score_test \
--gold_file data/depparse/en_tweet.test.gold.conllu \
--eval_file data/depparse/en_tweet.test.in.conllu

3. POS Tagging

python stanza/utils/training/run_pos.py tweet_ewt \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

4. Dependency Parsing

python stanza/utils/training/run_depparse.py tweet_ewt \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.txt \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweebank,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    publisher={arXiv},
    year={2022}
}

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC).

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Overview

TweebankNLP

Installation

Python Interface for Twitter-Stanza

Running Twitter-Stanza (Command Line Interface)

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

References

Acknowledgement

Owner

Laboratory for Social Machines

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Implementation of Fast Transformer in Pytorch

Weakly-supervised Text Classification Based on Keyword Graph

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Pipeline for chemical image-to-text competition

NLP - Machine learning

A paper list of pre-trained language models (PLMs).

FastFormers - highly efficient transformer models for NLU

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

UniSpeech - Large Scale Self-Supervised Learning for Speech

Data preprocessing rosetta parser for python

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Overview

TweebankNLP

Installation

Python Interface for Twitter-Stanza

Running Twitter-Stanza (Command Line Interface)

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

References

Acknowledgement

Owner

Laboratory for Social Machines

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Implementation of Fast Transformer in Pytorch

Weakly-supervised Text Classification Based on Keyword Graph

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Pipeline for chemical image-to-text competition

NLP - Machine learning

A paper list of pre-trained language models (PLMs).

FastFormers - highly efficient transformer models for NLU

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

UniSpeech - Large Scale Self-Supervised Learning for Speech

Data preprocessing rosetta parser for python

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。