Japanese NLP Library

Overview

Japanese NLP Library


Back to Home

1   Requirements

1.1   Links

  • All code at jProcessing Repo GitHub
  • PyPi Python Package
clone [email protected]:kevincobain2000/jProcessing.git

1.2   Install

In Terminal

bash$ python setup.py install

1.3   History

  • 0.2

    • Sentiment Analysis of Japanese Text
  • 0.1
    • Morphologically Tokenize Japanese Sentence
    • Kanji / Hiragana / Katakana to Romaji Converter
    • Edict Dictionary Search - borrowed
    • Edict Examples Search - incomplete
    • Sentence Similarity between two JP Sentences
    • Run Cabocha(ISO--8859-1 configured) in Python.
    • Longest Common String between Sentences
    • Kanji to Katakana Pronunciation
    • Hiragana, Katakana Chart Parser

2   Libraries and Modules

2.1   Tokenize jTokenize.py

In Python

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = u'私は彼を5日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print list_of_tokens
>>> print '--'.join(list_of_tokens).encode('utf-8')

Returns:

... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...]
... 私--は--彼--を--5--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> print '--'.join(jReads(input_sentence)).encode('utf-8')
... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2   Cabocha jCabocha.py

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding, with utf8 python

>>> from jNlp.jCabocha import cabocha
>>> print cabocha(input_sentence).encode('utf-8')

Output:

">
<sentence>
 <chunk id="0" link="8" rel="D" score="0.971639" head="0" func="1">
  <tok id="0" read="ワタシ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">私tok>
  <tok id="1" read="" base="" pos="助詞-係助詞" ctype="" cform="" ne="O">はtok>
 chunk>
 <chunk id="1" link="2" rel="D" score="0.488672" head="2" func="3">
  <tok id="2" read="カレ" base="" pos="名詞-代名詞-一般" ctype="" cform="" ne="O">彼tok>
  <tok id="3" read="" base="" pos="助詞-格助詞-一般" ctype="" cform="" ne="O">をtok>
 chunk>
 <chunk id="2" link="8" rel="D" score="2.25834" head="6" func="6">
  <tok id="4" read="" base="" pos="名詞-数" ctype="" cform="" ne="B-DATE">5tok>
  <tok id="5" read="ニチ" base="" pos="名詞-接尾-助数詞" ctype="" cform="" ne="I-DATE">日tok>
  <tok id="6" read="マエ" base="" pos="名詞-副詞可能" ctype="" cform="" ne="I-DATE">前tok>
  <tok id="7" read="" base="" pos="記号-読点" ctype="" cform="" ne="O">、tok>
 chunk>

2.3   Kanji / Katakana /Hiragana to Tokenized Romaji jConvert.py

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = u'気象庁が21日午前4時48分、発表した天気概況によると、'
>>> print ' '.join(tokenizedRomaji(input_sentence))
>>> print tokenizedRomaji(input_sentence)
...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...]

katakanaChart.txt

2.4   Longest Common String Japanese jProcessing.py

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print long_substr(a, b)

Output

...a time in

On Japanese Strings

>>> a = u'これでアナタも冷え知らず'
>>> b = u'これでア冷え知らずナタも'
>>> print long_substr(a, b).encode('utf-8')

Output

...冷え知らず

2.5   Similarity between two sentences jProcessing.py

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:
>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print s.minhash(a,b)
...0.444444444444
Japanese Strings:
>>> from jNlp.jProcessing import *
>>> a = u'これは何ですか?'
>>> b = u'これはわからないです'
>>> print s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b)))
...0.210526315789

3   Edict Japanese Dictionary Search with Example sentences

3.1   Sample Ouput Demo

3.2   Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3   Charset

Two files

  • utf8 Charset example file if not using src/jNlp/data/edict_examples

    To convert EUCJP/ISO-8859-1 to utf8

    iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
    
  • ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4   Links

Latest Dictionary files can be downloaded here

3.5   edict_search.py

author: Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print entry.to_string().encode('utf-8')

3.6   edict_examples.py

Note: Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author: Pulkit Kathuria
>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 x線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

4   Sentiment Analysis Japanese Text

This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*.txt).

4.1   Wordnet files download links

  1. http://nlpwww.nict.go.jp/wn-ja/eng/downloads.html
  2. http://sentiwordnet.isti.cnr.it/

4.2   How to Use

The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet.

  • (Adnouns, nouns, verbs, .. all included)
  • No WSD module on Japanese Sentence
  • Uses word as its common sense for polarity score
>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高!'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3   Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5   Contacts

Author: pulkit[at]jaist.ac.jp [change at with @]
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.6k Jan 04, 2023
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Mickaël Schoentgen 163 Dec 31, 2022