Pytorch-NLU

Pytorch-NLU是一个只依赖pytorch、transformers、numpy、tensorboardX，专注于文本分类、序列标注的极简自然语言处理工具包。支持BERT、ERNIE、ROBERTA、NEZHA、ALBERT、XLNET、ELECTRA、GPT-2、TinyBERT、XLM、T5等预训练模型; 支持BCE-Loss、Focal-Loss、Circle-Loss、Prior-Loss、Dice-Loss、LabelSmoothing等损失函数; 具有依赖轻量、代码简洁、注释详细、调试清晰、配置灵活、拓展方便、适配NLP等特性。

安装

pip install Pytorch-NLU

# 清华镜像源
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple Pytorch-NLU

数据

数据来源

免责声明：以下数据集由公开渠道收集而成, 只做说明; 科学研究、商用请联系原作者; 如有侵权, 请及时联系删除。

文本分类

baidu_event_extract_2020, 项目以 2020语言与智能技术竞赛：事件抽取任务中的数据作为多分类标签的样例数据，借助多标签分类模型来解决, 共13456个样本, 65个类别;
AAPD-dataset, 数据集出现在论文-SGM: Sequence Generation Model for Multi-label Classification, 英文多标签分类语料, 共55840样本, 54个类别;
toutiao-news, 今日头条新闻标题, 多标签分类语料, 约300w-语料, 1000+类别;
unknow-data, 来源未知, 多标签分类语料, 约22339语料, 7个类别;
SMP2018中文人机对话技术评测（ECDT）, SMP2018 中文人机对话技术评测（SMP2018-ECDT）比赛语料, 短文本意图识别语料, 多类分类, 共3069样本, 31个类别;
文本分类语料库（复旦）语料, 复旦大学计算机信息与技术系国际数据库中心自然语言处理小组提供的新闻语料, 多类分类语料, 共9804篇文档，分为20个类别。
MiningZhiDaoQACorpus, 中国科学院软件研究所刘焕勇整理的问答语料, 百度知道问答语料, 可以把领域当作类别, 多类分类语料, 100w+样本, 共17个类别;
THUCNEWS, 清华大学自然语言处理实验室整理的语料, 新浪新闻RSS订阅频道2005-2011年间的历史数据筛选, 多类分类语料, 74w新闻文档, 14个类别;
IFLYTEK, 科大讯飞开源的长文本分类语料, APP应用描述的标注数据，包含和日常生活相关的各类应用主题, 链接为CLUE, 共17333样例, 119个类别;
TNEWS, 今日头条提供的中文新闻标题分类语料, 数据集来自今日头条的新闻版块, 链接为CLUE, 共73360样例, 15个类别;

序列标注

Corpus_China_People_Daily, 由北京大学计算语言学研究所发布的《人民日报》标注语料库PFR, 来源为《人民日报》1998上半年, 2014年, 2015上半年-2016.1-2017.1-2018.1(新时代人民日报分词语料库NEPD)等的内容, 包括中文分词cws、词性标注pos、命名实体识别ner...等标注数据;
Corpus_CTBX, 由宾夕法尼亚大学(UPenn)开发并通过语言数据联盟（LDC）发布的中文句法树库(Chinese Treebank), 来源为新闻数据、新闻杂志、广播新闻、广播谈话节目、微博、论坛、聊天对话和电话数据等, 包括中文分词cws、词性标注pos、命名实体识别ner...等标注数据;
NER-Weibo, 中国社交媒体（微博）命名实体识别数据集（Weibo-NER-2015）, 该语料库包含2013年11月至2014年12月期间从微博上采集的1890条信息, 有两个版本(weiboNER.conll和weiboNER_2nd_conll), 共1890样例, 3个标签;
NER-CLUE, 中文细粒度命名实体识别(CLUE-NER-2020), CLUE筛选标注的THUCTC数据集(清华大学开源的新闻内容文本分类数据集), 共12091样例, 10个标签;
NER-Literature, 中文文学章篇级实体识别数据集(Literature-NER-2017), 数据来源为网站上1000多篇中国文学文章过滤提取的726篇, 共29096样本, 7个标签;
NER-Resume, 中文简历实体识别数据集(Resume-NER-2018), 来源为新浪财经网关于上市公司的高级经理人的简历摘要数据, 共1027样例，8个标签。
NER-BosonN, 中文新闻实体识别数据集(Boson-NER-2012), 数据集BosonNLP_NER_6C, 新增时间/公司名/产品名等标签, 共2000样例, 6个标签;
NER-MSRA, 中文新闻实体识别数据集(MSRA-NER-2005), 由微软亚洲研究院(MSRA)发布, 共55289样例, 通用的有3个标签, 完整的有26个标签;

数据格式

1. 文本分类  (txt格式, 每行为一个json):

多类分类格式:
{"text": "人站在地球上为什么没有头朝下的感觉", "label": "教育"}
{"text": "我的小baby", "label": "娱乐"}
{"text": "请问这起交通事故是谁的责任居多小车和摩托车发生事故在无红绿灯", "label": "娱乐"}

多标签分类格式:
{"label": "3|myz|5", "text": "课堂搞东西，没认真听"}
{"label": "3|myz|2", "text": "测验90-94.A-"}
{"label": "3|myz|2", "text": "长江作业未交"}

2. 序列标注 (txt格式, 每行为一个json):

SPAN格式如下:
{"label": [{"type": "ORG", "ent": "市委", "pos": [10, 11]}, {"type": "PER", "ent": "张敬涛", "pos": [14, 16]}], "text": "去年十二月二十四日，市委书记张敬涛召集县市主要负责同志研究信访工作时，提出三问：『假如上访群众是我们的父母姐妹，你会用什么样的感情对待他们？"}
{"label": [{"type": "PER", "ent": "金大中", "pos": [5, 7]}], "text": "今年2月，金大中新政府成立后，社会舆论要求惩治对金融危机负有重大责任者。"}
{"label": [], "text": "与此同时，作者同一题材的长篇侦破小说《鱼孽》也出版发行。"}

CONLL格式如下:
青 B-ORG
岛 I-ORG
海 I-ORG
牛 I-ORG
队 I-ORG
和 O

使用方式

更多样例sample详情见/test目录

文本分类(TC), text-classification

# !/usr/bin/python
# -*- coding: utf-8 -*-
# !/usr/bin/python
# -*- coding: utf-8 -*-
# @time    : 2021/2/23 21:34
# @author  : Mo
# @function: 多标签分类, 根据label是否有|myz|分隔符判断是多类分类, 还是多标签分类


# 适配linux
import platform
import json
import sys
import os
path_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))
sys.path.append(os.path.join(path_root, "pytorch_textclassification"))
print(path_root)
# 分类下的引入, pytorch_textclassification
from tcTools import get_current_time
from tcRun import TextClassification
from tcConfig import model_config

evaluate_steps = 320  # 评估步数
save_steps = 320  # 存储步数
# pytorch预训练模型目录, 必填
pretrained_model_name_or_path = "bert-base-chinese"
# 训练-验证语料地址, 可以只输入训练地址
path_corpus = os.path.join(path_root, "corpus", "text_classification", "school")
path_train = os.path.join(path_corpus, "train.json")
path_dev = os.path.join(path_corpus, "dev.json")


if __name__ == "__main__":
 
    model_config["evaluate_steps"] = evaluate_steps  # 评估步数
    model_config["save_steps"] = save_steps  # 存储步数
    model_config["path_train"] = path_train  # 训练模语料, 必须
    model_config["path_dev"] = path_dev      # 验证语料, 可为None
    model_config["path_tet"] = None          # 测试语料, 可为None
    # 损失函数类型,
    # multi-class:  可选 None(BCE), BCE, BCE_LOGITS, MSE, FOCAL_LOSS, DICE_LOSS, LABEL_SMOOTH
    # multi-label:  SOFT_MARGIN_LOSS, PRIOR_MARGIN_LOSS, FOCAL_LOSS, CIRCLE_LOSS, DICE_LOSS等
    model_config["path_tet"] = "FOCAL_LOSS"
    os.environ["CUDA_VISIBLE_DEVICES"] = str(model_config["CUDA_VISIBLE_DEVICES"])

    model_config["pretrained_model_name_or_path"] = pretrained_model_name_or_path
    model_config["model_save_path"] = "../output/text_classification/model_{}".format(model_type[idx])
    model_config["model_type"] = "BERT"
    # main
    lc = TextClassification(model_config)
    lc.process()
    lc.train()

序列标注(SL), sequence-labeling

# 适配linux
import platform
import json
import sys
import os
path_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))
path_sys = os.path.join(path_root, "pytorch_sequencelabeling")
sys.path.append(path_sys)
print(path_root)
print(path_sys)
# 分类下的引入, pytorch_textclassification
from slTools import get_current_time
from slRun import SequenceLabeling
from slConfig import model_config

evaluate_steps = 320  # 评估步数
save_steps = 320  # 存储步数
# pytorch预训练模型目录, 必填
pretrained_model_name_or_path = "bert-base-chinese"
# 训练-验证语料地址, 可以只输入训练地址
path_corpus = os.path.join(path_root, "corpus", "sequence_labeling", "ner_china_people_daily_1998_conll")
path_train = os.path.join(path_corpus, "train.conll")
path_dev = os.path.join(path_corpus, "dev.conll")


if __name__ == "__main__":
 
    model_config["evaluate_steps"] = evaluate_steps  # 评估步数
    model_config["save_steps"] = save_steps  # 存储步数
    model_config["path_train"] = path_train  # 训练模语料, 必须
    model_config["path_dev"] = path_dev      # 验证语料, 可为None
    model_config["path_tet"] = None          # 测试语料, 可为None
    # 一种格式 文件以.conll结尾, 或者corpus_type=="DATA-CONLL"
    # 另一种格式 文件以.span结尾, 或者corpus_type=="DATA-SPAN"
    model_config["corpus_type"] = "DATA-CONLL"# 语料数据格式, "DATA-CONLL", "DATA-SPAN"
    model_config["task_type"] = "SL-CRF"     # 任务类型, "SL-SOFTMAX", "SL-CRF", "SL-SPAN"

    model_config["dense_lr"] = 1e-3  # 最后一层的学习率, CRF层学习率/全连接层学习率, 1e-5, 1e-4, 1e-3
    model_config["lr"] = 1e-5        # 学习率, 1e-5, 2e-5, 5e-5, 8e-5, 1e-4, 4e-4
    model_config["max_len"] = 156    # 最大文本长度, None和-1则为自动获取覆盖0.95数据的文本长度, 0则取训练语料的最大长度, 具体的数值就是强制padding到max_len

    model_config["pretrained_model_name_or_path"] = pretrained_model_name_or_path
    model_config["model_save_path"] = "../output/sequence_labeling/model_{}".format(model_type[idx])
    model_config["model_type"] = model_type[idx]
    # main
    lc = SequenceLabeling(model_config)
    lc.process()
    lc.train()

paper

文本分类(TC, text-classification)

FastText: Bag of Tricks for Efﬁcient Text Classiﬁcation
TextCNN： Convolutional Neural Networks for Sentence Classiﬁcation
charCNN-kim： Character-Aware Neural Language Models
charCNN-zhang: Character-level Convolutional Networks for Text Classiﬁcation
TextRNN： Recurrent Neural Network for Text Classification with Multi-Task Learning
RCNN： Recurrent Convolutional Neural Networks for Text Classification
DCNN: A Convolutional Neural Network for Modelling Sentences
DPCNN: Deep Pyramid Convolutional Neural Networks for Text Categorization
VDCNN: Very Deep Convolutional Networks
CRNN: A C-LSTM Neural Network for Text Classification
DeepMoji: Using millions of emojio ccurrences to learn any-domain represent ations for detecting sentiment, emotion and sarcasm
SelfAttention: Attention Is All You Need
HAN: Hierarchical Attention Networks for Document Classification
CapsuleNet: Dynamic Routing Between Capsules
TextGCN: Graph Convolutional Networks for Text Classification
Transformer(encode or decode): Attention Is All You Need
Bert: BERT: Pre-trainingofDeepBidirectionalTransformersfor LanguageUnderstanding
ERNIE: ERNIE: Enhanced Representation through Knowledge Integration
Xlnet: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Albert: ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
RoBERTa: RoBERTa: A Robustly Optimized BERT Pretraining Approach
ELECTRA: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
GPT-2: Language Models are Unsupervised Multitask Learners
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

序列标注(SL, sequence-labeling)

Bi-LSTM-CRF: Bidirectional LSTM-CRF Models for Sequence Tagging
Bi-LSTM-LAN: Hierarchically-Reﬁned Label Attention Network for Sequence Labeling
CNN-LSTM: End-to-endSequenceLabelingviaBi-directionalLSTM-CNNs-CRF
DGCNN: Multi-Scale Context Aggregation by Dilated Convolutions
CRF: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
Biaffine-BER: Named Entity Recognition as Dependency Parsing
Lattice-LSTM: Lattice LSTM：Chinese NER Using Lattice LSTM
WC-LSTM: WC-LSTM: An Encoding Strategy Based Word-Character LSTM for Chinese NER Lattice LSTM
Lexicon: Simple-Lexicon：Simplify the Usage of Lexicon in Chinese NER
FLAT: FLAT: Chinese NER Using Flat-Lattice Transformer
MRC: A Unified MRC Framework for Named Entity Recognition

参考

keras与tensorflow版本对应: https://docs.floydhub.com/guides/environments/
BERT-NER-Pytorch: https://github.com/lonePatient/BERT-NER-Pytorch
bert4keras: https://github.com/bojone/bert4keras
Kashgari: https://github.com/BrikerMan/Kashgari
fastNLP: https://github.com/fastnlp/fastNLP
HanLP: https://github.com/hankcs/HanLP
scikit-learn: https://github.com/scikit-learn/scikit-learn
tqdm: https://github.com/tqdm/tqdm

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@software{Pytorch-NLU,
    url = {https://github.com/yongzhuo/Pytorch-NLU},
    author = {Yongzhuo Mo},
    title = {Pytorch-NLU},
    year = {2021}

*希望对你有所帮助!

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg：一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用，支持细分领域分词，有效提升了分词准确度。目录主要亮点编译和安装各类分词工具包的性能对比使用方式论文引用作者常见问题及解答主要

6k Dec 29, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 1, 2022

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 5, 2022

A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

48 Oct 11, 2022

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

1.6k Dec 27, 2022

1.5k Feb 11, 2021

self.do_lower_case 和 self.vocab 没定义，执行报错？！

https://github.com/yongzhuo/Pytorch-NLU/blob/864fb9acc7751fc51abd3d05d24b5a9a7eab7110/pytorch_nlu/pytorch_textclassification/tcData.py#L169

https://github.com/yongzhuo/Pytorch-NLU/blob/864fb9acc7751fc51abd3d05d24b5a9a7eab7110/pytorch_nlu/pytorch_textclassification/tcData.py#L171

这两个类变量在哪定义的？跑代码时报错！

opened by Wang-Zhenxing 2

v0.0.1(Sep 27, 2021)
Pytorch-NLU最初的版本，v0.0.1。

Pytorch-NLU是一个只依赖pytorch、transformers、numpy、tensorboardX，专注于文本分类、序列标注的极简自然语言处理工具包。 2.支持BERT、ERNIE、ROBERTA、NEZHA、ALBERT、XLNET、ELECTRA、GPT-2、TinyBERT、XLM、T5等预训练模型; 3.支持BCE-Loss、Focal-Loss、Circle-Loss、Prior-Loss、Dice-Loss、LabelSmoothing等损失函数; 4.具有依赖轻量、代码简洁、注释详细、调试清晰、配置灵活、拓展方便、适配NLP等特性。

Source code(tar.gz)
Source code(zip)

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Related tags

Overview

Pytorch-NLU

目录

安装

数据

数据来源

文本分类

序列标注

数据格式

使用方式

文本分类(TC), text-classification

序列标注(SL), sequence-labeling

paper

文本分类(TC, text-classification)

序列标注(SL, sequence-labeling)

参考

Reference

You might also like...

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Chinese Named Entity Recognization (BiLSTM with PyTorch)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

A text augmentation tool for named entity recognition.

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Comments

self.do_lower_case 和 self.vocab 没定义，执行报错？！

Releases(v0.0.1)

v0.0.1(Sep 27, 2021)

Owner

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Py65 65816 - Add support for the 65C816 to py65

To be a next-generation DL-based phenotype prediction from genome mutations.

Entity Disambiguation as text extraction (ACL 2022)

Just a Basic like Language for Zeno INC

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Understand Text Summarization and create your own summarizer in python

Partially offline multi-language translator built upon Huggingface transformers.

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Code release for "COTR: Correspondence Transformer for Matching Across Images"

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

edge-SR: Super-Resolution For The Masses

In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Shared code for training sentence embeddings with Flax / JAX

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)