A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Overview



GitHub issues GitHub stars GitHub license

古文自然语言处理模型合集,收录互联网上的古文相关模型及资源。

更多内容请参考:

古文预训练语言模型

古文预训练语言模型是处理各种古文任务的基础模型,需要结合各种下游任务数据微调,才能发挥最大作用。这里收集了所有互联网上公开的古文预训练语言模型:

名称 简/繁 下载链接 备注
guwenbert-base Hugging Face 基于殆知阁语料和中文模型训练
guwenbert-large Hugging Face
guwenbert-fs-base One Drive 基于殆知阁语料从头训练
roberta-classical-chinese-base-char 简繁 Hugging Face 基于guwenbert训练,扩展了繁体词表
roberta-classical-chinese-large-char 简繁 Hugging Face
sikubert Hugging Face 基于四库全书语料和中文模型训练
sikuroberta Hugging Face

古文应用模型

古文应用模型是基于古文预训练模型,结合特定领域数据微调得到的模型,能够实现古文的各种实际应用。其中guwen-X模型使用的训练数据可以在CCLUE中下载,如果输入包含繁体字请先使用本页最下方提到的工具进行转换。

古文断句

guwen-seg: 基于guwenbert-fs-base的断句模型。

古文标点

guwen-punc: 基于guwenbert-fs-base的标点模型。

古文引号检测

guwen-quote: 基于guwenbert-fs-base的引号检测模型。

注意:如下图所示,使用Transformers自带的序列标注模型存在一定误差,请在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文命名实体识别

guwen-ner: 基于guwenbert-base的命名实体识别模型。

注意:为取得最好表现,推荐在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文分类

guwen-cls: 基于guwenbert-fs-base的古文分类模型。

古诗情感分类

guwen-sent: 基于guwenbert-base的古文分类模型。

其他古文相关资源

  • OpenCC: 简繁转换工具
  • zhconv: 简繁转换工具 (注意需使用zh-hans选项,只转换单字,避免转换地区词)
  • 甲言Jiayan: 古汉语处理的NLP工具包,古文分词,词性标注,断句,标点等工具
  • UD-Kanbun: 古文分词,词性标注,句法解析
  • daizhigev20: 殆知阁古代文献v2.0语料库
  • chinese-poetry: 最全中文诗歌古典文集数据库
  • Classical-Chinese: 古文现代文翻译平行语料库

关于

本仓库旨在收集互联网上的开源古文NLP模型,版权归原作者所有,欢迎补充更多资源,如有问题可以在Issue区讨论,或邮件联系ethanyt at qq.com

Owner
Ethan
Natural Language Processing, Deep Learning, Information Retrieval, Full-stack Development.
Ethan
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022