Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

ACL'22: Structured Pruning Learns Compact and Accurate Models

AutoGluon: AutoML for Text, Image, and Tabular Data

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Extract Keywords from sentence or Replace keywords in sentences.

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Weird Sort-and-Compress Thing

Athena is an open-source implementation of end-to-end speech processing engine.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

AMUSE - financial summarization

Question answering app is used to answer for a user given question from user given text.

Implementation of Multistream Transformers in Pytorch

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Simple NLP based project without any use of AI

Implementation of legal QA system based on SentenceKoBART

Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.