Maha is a text processing library specially developed to deal with Arabic text.

Overview



CI Documentation Status codecov Discord Downloads License PyPI version Code style: black Checked with mypy PyPI - Python Version

An Arabic text processing library intended for use in NLP applications


Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments
  • Time: Add the ability to parse Hijri dates

    Time: Add the ability to parse Hijri dates

    What does this pull request change?

    Closes #27.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 6
  • Added distance to dimension parsing

    Added distance to dimension parsing

    What does this pull request change?

    Resolves #15.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    parsing highlight 
    opened by TRoboto 5
  • Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names

    What does this pull request change?

    This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

    Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 4
  • Add pyupgrade to pre-commit and upgrade to future-style type annotations

    Add pyupgrade to pre-commit and upgrade to future-style type annotations

    What does this pull request change?

    Upgrades to new type annotations style.

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    maintenance 
    opened by TRoboto 3
  • Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    Deprecate and remove `datasets` module and host datasets on Hugging Face instead

    What does this pull request change?

    • Removes datasets module.
    • Datasets are now hosted here

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    breaking changes deprecation 
    opened by TRoboto 3
  • Add the ability to parse names from text

    Add the ability to parse names from text

    What does this pull request change?

    Adds #24. Depends on #40

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [x] updated the documentation
    • [x] tox passes
    new feature highlight 
    opened by TRoboto 3
  • Add a deprecation system

    Add a deprecation system

    What does this pull request change?

    • Closes #23
    • Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    development 
    opened by saedx1 3
  • Prepare for the next release of Maha (v0.3.0)

    Prepare for the next release of Maha (v0.3.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.3.0.
    • Bumped pypi version to v0.3.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Ordinal: Add support to `بعد` in ordinal parsing

    Ordinal: Add support to `بعد` in ordinal parsing

    What does this pull request change?

    Closes #48.

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Numeral: Add support for hierarchical parsing

    Numeral: Add support for hierarchical parsing

    What does this pull request change?

    Closes #25

    Status (please check what you already did):

    • [x] added some tests for the functionality
    • [ ] updated the documentation
    • [x] tox passes
    new feature 
    opened by TRoboto 2
  • Prepare for the next release of Maha (v0.2.0)

    Prepare for the next release of Maha (v0.2.0)

    This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

    • Generated changelogs for release v0.2.0.
    • Bumped pypi version to v0.2.0.
    • Updated the citation information.
    opened by github-actions[bot] 2
  • Update ci.yml

    Update ci.yml

    Check the support for python 3,10

    What does this pull request change? It checks if the library is supporting python 3.10.

    • ...

    Status (please check what you already did):

    • [ ] added some tests for the functionality
    • [ ] updated the documentation
    • [ ] tox passes
    opened by PAIN-BARHAM 1
  • Add the option to ignore Harakat when removing or replacing

    Add the option to ignore Harakat when removing or replacing

    What problem are you trying to solve?

    Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

    Examples (if relevant)

    Current:

    >> from maha.cleaners.functions import remove
    >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة")
    >> output
    يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى
    

    Suggested:

    >> from maha.cleaners.functions import remove
    >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True)
    >> output
    يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى
    

    Definition of Done

    • It must adhere to the coding style used in the defined cleaner functions.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by xaleel 1
  • Wrong parsed name using name dimension

    Wrong parsed name using name dimension

    What happened?

    The name parser extracted wrong name likes : بي, شكرا.

    Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

    I expect to extract the names on the name dataset only.

    Python version

    3.8

    What operating system are you using?

    Linux

    Code to reproduce the issue

    >>> from maha.parsers.functions import parse_dimension
    >>> text = `أريد البحث في سجل الإنفاق الخاص بي`
    >>> extracted = parse_dimension(text, names=True)
    [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]
    

    Relevant log output

    No response

    bug parsing 
    opened by PAIN-BARHAM 0
  • Add feature to parse duration period

    Add feature to parse duration period

    What problem are you trying to solve?

    Parsing the duration from the text that has the difference between the two dates.

    Examples (if relevant)

    >>> from maha.parsers.functions import parse_dimension
    >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value
    >>> output
    DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)
    
    

    Definition of Done

    • It must adhere to the coding style used in the defined dimensions, duration dimension.
    • The implementation should cover most use cases.
    • Adding tests
    feature request 
    opened by PAIN-BARHAM 1
  • Adding the parser functionality to Processors

    Adding the parser functionality to Processors

    What problem are you trying to solve?

    Adding the parser functionality to Processors to parse different dimensions.

    Examples (if relevant)

    >>> from pathlib import Path
    >>> import maha
    >>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
    >>> data = resource_path.read_text()
    >>> print(data)
    
    الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
    طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
    يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
    مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
    لما حد يسالني بتختفي كتير لية =..
    زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
    #Windows11 is on the horizon. What feature are you looking forward to
    Get vaccinate #savethesaviour
    Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit
    
    >>> from maha.processors import FileProcessor
    >>> proc = FileProcessor(resource_path)
    >>> parsed = proc.parse_dimension(time=True)
    [Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
     Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
     Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]
    
    

    Definition of Done

    • It must adhere to the coding style.
    • The implementation should cover most use cases.
    • Adding tests.
    good first issue feature request parsing 
    opened by PAIN-BARHAM 0
Releases(v0.3.0)
Owner
Mohammad Al-Fetyani
Machine Learning Engineer
Mohammad Al-Fetyani
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022