Chinese segmentation library

Last update: Jun 28, 2022

Related tags

Overview

What is loso?

loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc.

Copyright & Licnese

Setup loso

To install loso, clone the repo and run following command

cd loso
python setup.py develop

Also, you need to run a redis database for storing the lexicon database. Also, you need to copy configuration template and modify it.

cp default.yaml myconf.yaml
vim myconf.yaml

To use your configuration, you have to set the configuration environment variable LOSO_CONFIG_FILE. For example:

LOSO_CONFIG_FILE=myconfig.yaml python setup.py server

Use loso

Loso determines segmentation according to the lexicon database, and the algorithm is based on Hidden Makov Model, therefore, it is not possible to use the service before building a lexicon database.

To feed a text file to the database, here you can run

python setup.py feed -f /home/victorlin/plurk_src/realtime_search/word_segment/sample_data/sample_tr_ch

To clean the database, you can run

python setup.py reset

To interact and test for splitting terms, here you can run

python setup.py interact

For example

Text: 留下鉅細靡遺的太空梭發射影片，供世人回味
....
留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

To use the segmentation service as XMLRPC service, here you can run

python setup.py serve

Following is a simple Python program for showing how to use it

import xmlrpclib

proxy = xmlrpclib.ServerProxy("http://localhost:5566/")

terms = proxy.splitTerms(u'留下鉅細靡遺的太空梭發射影片，供世人回味')
print ' '.join(terms)

And the output should be

留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

Chinese segmentation library

Related tags

Overview

What is loso?

Copyright & Licnese

Setup loso

Use loso

Owner

Fang-Pen Lin

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Weakly-supervised Text Classification Based on Keyword Graph

A library for end-to-end learning of embedding index and retrieval model

YACLC - Yet Another Chinese Learner Corpus

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

A complete NLP guideline for enthusiasts

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

💫 Industrial-strength Natural Language Processing (NLP) in Python

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

Linear programming solver for paper-reviewer matching and mind-matching

LCG T-TEST USING EUCLIDEAN METHOD

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.