Japanese NLP Library

Back to Home

Contents

1 Requirements
- 1.1 Links
- 1.2 Install
- 1.3 History
2 Libraries and Modules
3 Edict Japanese Dictionary Search with Example sentences
4 Sentiment Analysis Japanese Text
5 Contacts

1 Requirements

Third Party Dependencies
- Cabocha Japanese Morphological parser http://sourceforge.net/projects/cabocha/
Python Dependencies
- Python 2.6.* or above

1.1 `Links`

All code at jProcessing Repo GitHub

Documentation and HomePage and Sphinx

PyPi Python Package

clone [email protected]:kevincobain2000/jProcessing.git

1.2 `Install`

In Terminal

bash$ python setup.py install

1.3 History

0.2
- Sentiment Analysis of Japanese Text
0.1
- Morphologically Tokenize Japanese Sentence
- Kanji / Hiragana / Katakana to Romaji Converter
- Edict Dictionary Search - borrowed
- Edict Examples Search - incomplete
- Sentence Similarity between two JP Sentences
- Run Cabocha(ISO--8859-1 configured) in Python.
- Longest Common String between Sentences
- Kanji to Katakana Pronunciation
- Hiragana, Katakana Chart Parser

2 Libraries and Modules

2.1 Tokenize `jTokenize.py`

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を５日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--５--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2 Cabocha `jCabocha.py`

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

If cabocha is configured as utf8 then see this http://nltk.googlecode.com/svn/trunk/doc/book-jp/ch12.html#cabocha

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

私は彼を５日前、 ">

<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="私" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="ハ" base="は" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="彼" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="ヲ" base="を" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="ゴ" base="５" pos="名詞-数" ctype="" cform="" ne="B-DATE">５tok>
  <tok id="5" read="ニチ" base="日" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="前" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="、" base="、" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3 Kanji / Katakana /Hiragana to Tokenized Romaji `jConvert.py`

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が２１日午前４時４８分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)

...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

katakanaChartFile and hiraganaChartFile

2.4 Longest Common String Japanese `jProcessing.py`

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5 Similarity between two sentences `jProcessing.py`

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:

>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444

Japanese Strings:

>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか？'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3 Edict Japanese Dictionary Search with Example sentences

3.2 Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3 Charset

Two files

utf8 Charset example file if not using src/jNlp/data/edict_examples

To convert EUCJP/ISO-8859-1 to utf8
```
iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
```
ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4 Links

Latest Dictionary files can be downloaded here

3.5 `edict_search.py`

author:	Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6 `edict_examples.py`

Note:	Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author:	Pulkit Kathuria

>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 ｘ線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4 Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1 Wordnet files download links

4.2 How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

(Adnouns, nouns, verbs, .. all included)
No WSD module on Japanese Sentence
Uses word as its common sense for polarity score

>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高！'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3 Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5 Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]

Japanese NLP Library

Related tags

Overview

Japanese NLP Library

Owner

Pulkit Kathuria

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

Large-scale Knowledge Graph Construction with Prompting

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

💫 Industrial-strength Natural Language Processing (NLP) in Python

CLIPfa: Connecting Farsi Text and Images

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

Korean Sentence Embedding Repository

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

CoSENT 比Sentence-BERT更有效的句向量方案

Google's Meena transformer chatbot implementation

Espial is an engine for automated organization and discovery of personal knowledge

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

jiant is an NLP toolkit

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

State of the art faster Natural Language Processing in Tensorflow 2.0 .

Sample data associated with the Aurora-BP study

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다