Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Last update: Dec 29, 2022

Overview

Simplemma: a simple multilingual lemmatizer for Python

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Currently, 38 languages are partly or fully supported (see table below).

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma

pip3 where applicable
pip install -U simplemma for updates

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Tokenization

A simple tokenization function is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy (affecting lemmatization) and silent (affecting errors and logging) as arguments:

>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Caveats

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.

Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often doesn't change the input then.

The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages. It can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2021-10-19)
Code	Language	Word pairs	Acc.	Comments
`bg`	Bulgarian	73,847		low coverage
`ca`	Catalan	579,507
`cs`	Czech	34,674		low coverage
`cy`	Welsh	360,412
`da`	Danish	554,238		alternative: lemmy
`de`	German	683,207	0.95	on UD DE-GSD, see also German-NLP list
`el`	Greek	76,388		low coverage
`en`	English	136,162	0.94	on UD EN-GUM, alternative: LemmInflect
`es`	Spanish	720,623	0.94	on UD ES-GSD
`et`	Estonian	133,104		low coverage
`fa`	Persian	10,967		low coverage
`fi`	Finnish	2,106,359		alternatives: voikko or NLP list
`fr`	French	217,213	0.94	on UD FR-GSD
`ga`	Irish	383,448
`gd`	Gaelic	48,661
`gl`	Galician	384,183
`gv`	Manx	62,765
`hu`	Hungarian	458,847
`hy`	Armenian	323,820
`id`	Indonesian	17,419	0.91	on UD ID-CSUI
`it`	Italian	333,680	0.92	on UD IT-ISDT
`ka`	Georgian	65,936
`la`	Latin	850,283
`lb`	Luxembourgish	305,367
`lt`	Lithuanian	247,337
`lv`	Latvian	57,153
`mk`	Macedonian	57,063
`nb`	Norwegian (Bokmål)	617,940
`nl`	Dutch	254,073	0.91	on UD-NL-Alpino
`pl`	Polish	3,723,580
`pt`	Portuguese	933,730	0.92	on UD-PT-GSD
`ro`	Romanian	311,411
`ru`	Russian	607,416		alternative: pymorphy2
`sk`	Slovak	846,453	0.87	on UD SK-SNK
`sl`	Slovenian	97,050		low coverage
`sv`	Swedish	658,606		alternative: lemmy
`tr`	Turkish	1,333,137	0.88	on UD-TR-Boun
`uk`	Ukrainian	190,472		alternative: pymorphy2

Low coverage mentions means you'd probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced using the script udscore.py in the tests/ folder.

Roadmap

[-] Add further lemmatization lists
[ ] Grammatical categories as option
[ ] Function as a meta-package?
[ ] Integrate optional, more complex models?

Credits

Software under MIT license, for the linguistic information databases see licenses folder.

The surface lookups (non-greedy mode) use lemmatization lists taken from various sources:

Lemmatization lists by Michal Měchura (Open Database License)
FreeLing project
spaCy lookups data
Wiktionary entries parsed by the Kaikki project
Wikinflection corpus by Eleni Metheniti (CC BY 4.0 License)
Unimorph Project

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

References

Barbaresi A. (2021). Simplemma: a simple multilingual lemmatizer for Python. Zenodo. http://doi.org/10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Barbaresi, A., & Hein, K. (2017). Data-driven identification of German phrasal compounds. In International Conference on Text, Speech, and Dialogue Springer, pp. 192-200.
Barbaresi, A. (2016). Bootstrapped OCR error detection for a less-resourced language variant. In 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21-26.

Comments

Huge memory usage for some languages, e.g. Finnish
While testing https://github.com/NatLibFi/Annif/pull/626, I noticed that Simplemma needs a lot of memory for Finnish language lemmatization and language detection. I did a little comparison to see how the choice of language affects memory consumption.

First the baseline, English:

#!/usr/bin/env python from simplemma import text_lemmatizer print(text_lemmatizer("she sells seashells on a seashore", lang='en'))

Results (partial) of running this with /usr/bin/time -v:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14 Maximum resident set size (kbytes): 41652

So far so good. Then other languages I know.

Estonian:

print(text_lemmatizer("jääääre kuuuurija töööö", lang='et'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16 Maximum resident set size (kbytes): 41252

practically same as English

Swedish:

print(text_lemmatizer("sju sköna sjuksköterskor skötte sju sjösjuka sjömän", lang='sv'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.70 Maximum resident set size (kbytes): 190920

somewhat slower, needs 150MB more memory

...

Finnish:

print(text_lemmatizer("mustan kissan paksut posket", lang='fi'))

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.44 Maximum resident set size (kbytes): 1059328

oops, now takes at least an order of magnitude longer and 1GB of memory! Why?

I checked the sizes of the data files for these four languages:

14M fi.plzma 2.6M sv.plzma 604K et.plzma 479K en.plzma

There seems to be a correlation with memory usage here. The data file for Finnish is much bigger than the others, and Swedish is also big compared to English and Estonian. But I think something must be wrong if a data file that is ~10MB larger than the others leads to 1GB of extra memory usage.
enhancement
opened by osma 19
Support for Northern Sami language

I propose that Simplemma could support the Northern Sami language (ISO 639-1 code se). I understood from this discussion that adding a new language would require a corpus of word + lemma pairs. My colleague @nikopartanen found at least these two corpora that could perhaps be used as raw material:

SIKOR North Saami free corpus - this is a relatively large (9M tokens) corpus. From the description:

The corpus has been automatically processed and linguistically analyzed with the Giellatekno/Divvun tools. Therefore, it may contain wrong annotations.

Another obvious corpus is the Universal Dependency treebank for Northern Sami.

Thoughts on these two corpora? What needs to be done to make this happen?
enhancement

opened by osma 12
Add some simple suffix rules for Finnish

This PR adds some generated suffix rules for Finnish language, as discussed in #19.

All these rules have an accuracy above 90% as evaluated on the Finnish language dictionary included with Simplemma. Collectively they cover around 6% of the dictionary entries.

opened by osma 10
Porting to C++

This seems like a relatively simple library since I imagine most of the work is done on the data collection part. Would you consider porting it to C++ (and possibly other languages) so it can be more easily integrated to user applications?
question

opened by 1over137 3
Sourcery refactored main branch
Branch main refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the main branch, then run:

git fetch origin sourcery/main git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 3
German lemmatization: Definite articles
Hi @adbar,

thanks for providing this really cool (and capable) library.

Currently, all German definite articles ("der", "die", "das") are lemmatized to "der", which is wrong.

import simplemma simplemma.lemmatize("Das", lang=('de',)) # Output: 'der' simplemma.lemmatize("Die", lang=('de',)) # Output: 'der' simplemma.lemmatize("Der", lang=('de',)) # Output: 'der'

Normally, I would not be too picky about wrong lemmatization (this happens all the time). But since these are among the most common German words, this should be correct. Their respective lemma should be the same word each: "Das" -> "Das", "Die" -> "Die", "Der" -> "Der". The lemma should be lowercase, if the original token is lowerecased, too.
question
opened by GrazingScientist 2
Add some simple suffix rules for Finnish (Sourcery refactored)
Pull Request #23 refactored by Sourcery.

Since the original Pull Request was opened as a fork in a contributor's repository, we are unable to create a Pull Request branching from it.

To incorporate these changes, you can either:

Merge this Pull Request instead of the original, or

Ask your contributor to locally incorporate these commits and push them to the original Pull Request

Incorporate changes via command line

git fetch https://github.com/adbar/simplemma pull/23/head git merge --ff-only FETCH_HEAD git push

NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Help us improve this pull request!
opened by sourcery-ai[bot] 2
Add some simple suffix rules for Finnish (Sourcery refactored)
Pull Request #23 refactored by Sourcery.

Since the original Pull Request was opened as a fork in a contributor's repository, we are unable to create a Pull Request branching from it.

To incorporate these changes, you can either:

Merge this Pull Request instead of the original, or

Ask your contributor to locally incorporate these commits and push them to the original Pull Request

Incorporate changes via command line

git fetch https://github.com/adbar/simplemma pull/23/head git merge --ff-only FETCH_HEAD git push

NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Help us improve this pull request!
opened by sourcery-ai[bot] 2
English lemmatize returns funny nonsense lemma 'splitshine' for token 'e'

Probably not intended, but this comes up with the current version 0.8.0

import simplemma simplemma.lemmatize("e", "en") 'spitshine'

BR, Bart Depoortere
bug

opened by bartdpt 2
How can I speed up Simplemma for Polish?
First of all, thank you for creating this great lemmatizer!

I've used it for English and it's blazing fast (in my trials, it found the lemma in less than 10 ms). For other languages, such as Portuguese and Spanish it's still reasonably fast, with lemmatization working under 50ms.

For Polish, however, lemmatization is taking over 2 seconds. I know Polish is a more complicated language for lemmatizing because it's inflected, but is there a way I can speed it up? Ideally, I'd like to have it below 100ms, but even under 500ms I think it would be good enough.

In my trials, I'm passing a single word and doing it simply like this:

import simplemma import time start = time.time() langdata = simplemma.load_data('pl') print(simplemma.lemmatize('robi', langdata)) end = time.time() print(end-start)
opened by rafaelsaback 2
Support for Asturian

Hi, the README says that this project uses data from Lemmatization lists by Michal Měchura, but the Asturian language is not supported in simplemma.

I'm wondering if there there are any plans to add the support for Asturian?
enhancement

opened by BLKSerene 1
Documentation: Please elaborate on the `greedy` parameter

Hi @adbar ,

could you please in you documentation go into detail on what the greedy parameter actually does. In the documentation, it is mentioned as if it would be self-explanatory. However, I really cannot estimate its potential/dangers.

Thanks! :smiley:
documentation

opened by GrazingScientist 0
Czech tokenizer is very slow
Thanks for this great tool! I'm using it to align English lemmas with Czech lemmas for an MT use case. I'm finding that while the English lemmatizer is extremely fast (haven't benchmarked, but your ~250k/s estimate seems reasonably close), Czech is quite slow.

A tiny microbenchmark (admittedly misleading) shows disparity between the two:

In [1]: from simplemma import lemmatize In [2]: %timeit -n1 lemmatize("fanoušků", lang="cs") The slowest run took 768524.54 times longer than the fastest. This could mean that an intermediate result is being cached. 27.6 ms ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit -n1 lemmatize("banks", lang="en") The slowest run took 464802.09 times longer than the fastest. This could mean that an intermediate result is being cached. 16.9 ms ± 41.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I see that lemmas are cached in the source code, with a single loop we see a near double penalty for Czech over English. I'm not exactly sure how to improve this immediately so I'm reporting it here for awareness and for others. 😄
question
opened by erip 3
Language detector can be tricked
I'm not sure if this is even worth reporting, but I was looking closer at how the language detection works and I realized that for languages that have rules (English and German currently), it can be quite easily tricked to accept nonsensical input and respond with high confidence that it is English (or probably German):

>>> from simplemma.langdetect import in_target_language, lang_detector >>> in_target_language('grxlries cnwated pltdoms', lang='en') 1.0 >>> lang_detector('grxlries cnwated pltdoms', lang=('en','sv','de','fr','es')) [('en', 1.0), ('unk', 0.0), ('sv', 0.0), ('de', 0.0), ('fr', 0.0), ('es', 0.0)]

This happens because the English language rules match the word suffixes (in this case, -ries, -ated and -doms) in the input words without looking too carefully at the preceding characters (the word length matters though).

This is unlikely to happen with real world text inputs, but it's possible for an attacker to deliberately trick the detector.
question
opened by osma 2
Additional inflection data for RU & UK

Hi, I'm the author of SSM which is a language learning utility for quickly making vocabulary flashcards. Thanks for this project! Without this it would have been difficult to provide multilingual lemmatization, which is an essential aspect of this tool.

However, I found that this is not particularly accurate for Russian. PyMorphy2 is a lemmatizer for Russian that I used in other projects. It's very fast and accurate in my experience, much more than spacy or anything else. Any chance you can include PyMorphy2's data in this library?
question

opened by 1over137 10
Use additional sources for better coverage
[x] http://unimorph.ethz.ch/languages

[x] https://github.com/lenakmeth/Wikinflection-Corpus

[x] https://github.com/TALP-UPC/FreeLing/tree/master/data/

[x] https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data

[x] https://github.com/tatuylonen/wiktextract

[ ] http://pauillac.inria.fr/~sagot/index.html#udlexicons

enhancement
opened by adbar 0

Releases(v0.9.0)

v0.9.0(Oct 18, 2022)
smaller data files (especially for fi, la, pl, pt, sk & tr, #19)

added support for Asturian (ast, #20)

bug fixes (#18, #26)

Source code(tar.gz)
Source code(zip)
v0.8.2(Sep 5, 2022)
languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog

fix for slow language detection introduced in 0.7.0

Full Changelog: https://github.com/adbar/simplemma/compare/v0.8.1...v0.8.2
Source code(tar.gz)
Source code(zip)
v0.8.1(Sep 1, 2022)
better rules for English and German

inconsistencies fixed for cy, de, en, ga, sv (#16)

docs: added language detection and citation info

Full Changelog: https://github.com/adbar/simplemma/compare/v0.8.0...v0.8.1
Source code(tar.gz)
Source code(zip)
v0.8.0(Aug 2, 2022)
code fully type checked, optional pre-compilation with mypyc

fixes: logging error (#11), input type (#12)

code style: black

Full Changelog: https://github.com/adbar/simplemma/compare/v0.7.0...v0.8.0
Source code(tar.gz)
Source code(zip)
v0.7.0(Jun 16, 2022)
breaking change: language data pre-loading now occurs internally, language codes are now directly provided in lemmatize() call, e.g. simplemma.lemmatize("test", lang="en")

faster lemmatization and result cache

sentence-aware text_lemmatizer()

optional iterators for tokenization and lemmatization

Full Changelog: https://github.com/adbar/simplemma/compare/v0.6.0...v0.7.0
Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 6, 2022)
improved language models

improved tokenizer

maintenance and code efficiency

added basic language detection (undocumented)

Full Changelog: https://github.com/adbar/simplemma/compare/v0.5.0...v0.6.0
Source code(tar.gz)
Source code(zip)
v0.5.0(Nov 19, 2021)
faster, more efficient code

dropped support for Python 3.5

Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 19, 2021)
new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish

language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish

Urdu removed of language list due to issues with the data

add support for Python 3.10 and drop support for Python 3.4

improved decomposition and tokenization algorithms

Source code(tar.gz)
Source code(zip)
v0.3.0(Apr 8, 2021)
improved models and disambiguation

improved tokenization

extended rules for German

Source code(tar.gz)
Source code(zip)
v0.2.2(Feb 24, 2021)
Work on decomposition rules

Reviewed language data

Cleaner code

Source code(tar.gz)
Source code(zip)
v0.2.1(Feb 2, 2021)
Better decomposition into subwords by greedy algorithm

First benchmarks and data-based corrections: German, French, English, Spanish

Source code(tar.gz)
Source code(zip)
v0.2.0(Jan 25, 2021)
Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu

Improved word pair coverage

Tokenization functions added

Limit greediness and range of potential candidates

Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 18, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Adrien Barbaresi

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub Repository https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html

一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人此项目为基于 Nonebot2 和 go-cqhttp 开发，以 Sqlite 作为数据库的QQ群娱乐机器人关于纯兴趣开发，部分功能借鉴了大佬们的代码，作为Q群的娱乐+功能性Bot 声明此项目仅用于学习交流，请勿用于非法用途这是开发者的第一个Pytho

79 Dec 29, 2022

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

28 Nov 10, 2022

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

1 Feb 23, 2022

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

1 Jan 27, 2022

Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

1.2k Dec 17, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 02, 2023

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

276 Dec 31, 2022

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

45 Jan 05, 2023

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022

Russian GPT3 models.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Larg

1.6k Jan 05, 2023

Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

23 Dec 30, 2022

Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

1.9k Jan 06, 2023

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

217 Nov 25, 2022

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

16 Apr 02, 2022

Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

3 Oct 06, 2022

NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

1 Jan 13, 2022

Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022

Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

43 Nov 28, 2022

We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

2 Nov 13, 2021

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022