Python library for processing Chinese text

Overview

SnowNLP: Simplified Chinese Text Processing

SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob不同的是,这里没有用NLTK,所有的算法都是自己实现的,并且自带了一些训练好的字典。注意本程序都是处理的unicode编码,所以使用时请自行decode成unicode。

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')

s.words         # [u'这个', u'东西', u'真心',
                #  u'很', u'赞']

s.tags          # [(u'这个', u'r'), (u'东西', u'n'),
                #  (u'真心', u'd'), (u'很', u'd'),
                #  (u'赞', u'Vg')]

s.sentiments    # 0.9769663402895832 positive的概率

s.pinyin        # [u'zhe', u'ge', u'dong', u'xi',
                #  u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。')

s.han           # u'「繁体字」「繁体中文」的叫法
                # 在台湾亦很常见。'

text = u'''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,
所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,
而在于研制能有效地实现自然语言通信的计算机系统,
特别是其中的软件系统。因而它是计算机科学的一部分。
'''

s = SnowNLP(text)

s.keywords(3)	# [u'语言', u'自然', u'计算机']

s.summary(3)	# [u'因而它是计算机科学的一部分',
                #  u'自然语言处理是一门融语言学、计算机科学、
				#	 数学于一体的科学',
				#  u'自然语言处理是计算机科学领域与人工智能
				#	 领域中的一个重要方向']
s.sentences

s = SnowNLP([[u'这篇', u'文章'],
             [u'那篇', u'论文'],
             [u'这个']])
s.tf
s.idf
s.sim([u'文章'])# [0.3756070762985226, 0, 0]

Features

  • 中文分词(Character-Based Generative Model
  • 词性标注(TnT 3-gram 隐马)
  • 情感分析(现在训练数据主要是买卖东西时的评价,所以对其他的一些可能效果不是很好,待解决)
  • 文本分类(Naive Bayes)
  • 转换成拼音(Trie树实现的最大匹配)
  • 繁体转简体(Trie树实现的最大匹配)
  • 提取文本关键词(TextRank算法)
  • 提取文本摘要(TextRank算法)
  • tf,idf
  • Tokenization(分割成句子)
  • 文本相似(BM25
  • 支持python3(感谢erning

Get It now

$ pip install snownlp

关于训练

现在提供训练的包括分词,词性标注,情感分析,而且都提供了我用来训练的原始文件 以分词为例 分词在snownlp/seg目录下

from snownlp import seg
seg.train('data.txt')
seg.save('seg.marshal')
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')

这样训练好的文件就存储为seg.marshal了,之后修改snownlp/seg/__init__.py里的data_path指向刚训练好的文件即可

License

MIT licensed.

Owner
Rui Wang
Rui Wang
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022