SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

Overview

SimpleChinese2

SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

声明

本项目是为方便个人工作所创建的,仅有部分代码原创。包括分词、词云在内的诸多功能来自于其他项目,并非本人所写,如遇问题,请至原项目链接下提问,谢谢!

安装

pip install -U simplechinese==0.2.8

如从 git 上 clone,需要从以下地址下载词向量文件:

https://drive.google.com/file/d/1ltyiTHZk8kIBYQGbZS9GoO_DwDOEWnL9/view?usp=sharing

并拷贝至"./simplechinese/data/"文件夹下

使用方法

import simplechinese as sc

1. 文字预处理

>> print(sc.only_digits(x)) # 仅保留数字 01234 >>> print(sc.only_zh(x)) # 仅保留中文 测试测试测试测试 >>> print(sc.only_en(x)) # 仅保留英文 TestING >>> print(sc.remove_space(x)) # 去除空格 测试测试,TestING;¥%&01234测试测试 >>> print(sc.remove_digits(x)) # 去除数字 测试测试,TestING ;¥%& 测试测试 >>> print(sc.remove_zh(x)) # 去除中文 ,TestING ;¥%& 01234 >>> print(sc.remove_en(x)) # 去除英文 测试测试, ;¥%& 01234测试测试 >>> print(sc.remove_punctuations(x)) # 去除标点符号 测试测试TestING 01234测试测试 >>> print(sc.toLower(x)) # 修改为全小写字母 测试测试,testing ;¥%& 01234测试测试 >>> print(sc.toUpper(x)) # 修改为全大写字母 测试测试,TESTING ;¥%& 01234测试测试 >>> x = "测试,TestING:12345@#【】+=-()。." >>> print(sc.punc_norm(x)) # 将中文标点符号转换成英文标点符号 测试,TestING:12345@#[]+=-().. >>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str ">
>>> x = "测试测试,TestING    ;¥%& 01234测试测试"

>>> print(sc.only_digits(x))         # 仅保留数字
01234

>>> print(sc.only_zh(x))             # 仅保留中文
测试测试测试测试

>>> print(sc.only_en(x))             # 仅保留英文
TestING

>>> print(sc.remove_space(x))        # 去除空格
测试测试,TestING;¥%&01234测试测试

>>> print(sc.remove_digits(x))       # 去除数字
测试测试,TestING    ;¥%& 测试测试

>>> print(sc.remove_zh(x))           # 去除中文
,TestING    ;¥%& 01234

>>> print(sc.remove_en(x))           # 去除英文
测试测试,    ;¥%& 01234测试测试

>>> print(sc.remove_punctuations(x)) # 去除标点符号
测试测试TestING     01234测试测试

>>> print(sc.toLower(x))             # 修改为全小写字母
测试测试,testing    ;¥%& 01234测试测试

>>> print(sc.toUpper(x))             # 修改为全大写字母
测试测试,TESTING    ;¥%& 01234测试测试

>>> x = "测试,TestING:12345@#【】+=-()。."
>>> print(sc.punc_norm(x))           # 将中文标点符号转换成英文标点符号
测试,TestING:12345@#[]+=-()..

>>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str

2. 基础NLP信息提取功能

该部分中,分词功能使用 jieba 实现,源码请参考:https://github.com/fxsjy/jieba

同/近义词查找功能复用了 synonyms 中的词向量数据文件,源码请参考:https://github.com/chatopera/Synonyms 但有所改动,改动如下

  1. 由于 pip 上传文件限制,synonyms 需要用户在完成 pip 安装后再下载词向量文件,国内下载需要设置镜像地址或使用特殊手段,有所不便。因此此处将词向量用 float16 表示,并使用 pca 降维至 64 维。总体效果差别不大,如果在意,请直接安装 synonyms 处理同/近义词查找任务。

  2. 原项目通过构建 KDTree 实现快速查找,但比较相似度是使用 cosine similarity,而 KDTree (sklearn) 本身不支持通过 cosine similarity 构建。因此原项目使用欧式距离构建树,导致输出结果有部分顺序混乱。为修复该问题,本项目将词向量归一化后再构建 KDTree,使得向量间的 cosine similarity 与欧式距离(即割线距离)正相关。具体推导可参考下文:https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier

  3. 原项目中未设置缓存上限,本项目中仅保留最近10000次查找记录。

x = "今天是我参加工作的第1天,我花了23.33元买了写零食犒劳一下自己。"
print(sc.extract_nums(x))              # 提取数字信息
[1.0, 23.33]

# mode: 0: No single character words. The words may be overlapped.
#       1: Have single character words. The words may be overlapped.
#       2: No single character words. The words are not overlapped.
#       3: Have single character words. The words are not overlapped.
#       4: Only single characters.
print(sc.extract_words(x, mode=0))      # 分词
['今天', '参加', '工作', '我花', '23.33', '零食', '犒劳', '一下', '自己']

a = "做人真的好难"
b = "做人实在太难了"
print(sc.string_distance(a,b))  # 编辑距离
0.46153846153846156

x = "种族歧视"
print(sc.find_synonyms(x, n=3))  # 同/近义词
[('种族歧视', 1.0), ('种族主义', 0.84619140625), ('歧视', 0.76416015625)]

3. 繁体简体转换

该部分使用 chinese_converter 实现,源码请参考:https://github.com/zachary822/chinese-converter

>> print(sc.to_traditional(x)) # 转换为繁体 烏龜測試123 >>> x = "烏龜測試123" >>> print(sc.to_simplified(x)) # 转换为简体 乌龟测试123 ">
>>> x = "乌龟测试123"
>>> print(sc.to_traditional(x))  # 转换为繁体
烏龜測試123

>>> x = "烏龜測試123"
>>> print(sc.to_simplified(x))   # 转换为简体
乌龟测试123

4. 特征提取和向量化

5. 词云和可视化

TODO:

  1. 句子向量化及句子相似度
  2. 其他特征提取相关工具
Owner
Ming
惊了
Ming
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Smart discord chatbot integrated with Dialogflow

academic-NLP-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

THUNLP 2.3k Jan 08, 2023
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
Code associated with the Don't Stop Pretraining ACL 2020 paper

dont-stop-pretraining Code associated with the Don't Stop Pretraining ACL 2020 paper Citation @inproceedings{dontstoppretraining2020, author = {Suchi

AI2 449 Jan 04, 2023
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

303 Dec 17, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022