NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Overview

NLPretext

Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web? 😫

Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post? 😥

NLPretext got you covered! 🚀

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

🔍 Quickly explore below our preprocessing pipelines and individual functions referential.

Cannot find what you were looking for? Feel free to open an issue.

Installation

This package has been tested on Python 3.6, 3.7 and 3.8.

We strongly advise you to do the remaining steps in a virtual environnement.

To install this library you just have to run the following command:

pip install nlpretext

This library uses Spacy as tokenizer. Current models supported are en_core_web_sm and fr_core_news_sm. If not installed, run the following commands:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.3.0/fr_core_news_sm-2.3.0.tar.gz

Preprocessing pipeline

Default pipeline

Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life !!! I recommend"

Create your custom pipeline

Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"

Take a look at all the functions that are available here in the preprocess.py scripts in the different folders: basic, social, token.

Individual Functions

Replacing emails

from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to [email protected]"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"

Replacing phone numbers

from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"

Removing Hashtags

from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"

Extracting emojis

from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin 😀"
example = extract_emojis(example)
print(example)
# [':grinning_face:']

Data augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=wordnet_synonym”, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."

Make HTML documentation

In order to make the html Sphinx documentation, you need to run at the nlpretext root path: sphinx-apidoc -f nlpretext -o docs/ This will generate the .rst files. You can generate the doc with cd docs && make html

You can now open the file index.html located in the build folder.

Project Organization


├── LICENSE
├── VERSION
├── CONTRIBUTING.md     <- Contribution guidelines
├── README.md           <- The top-level README for developers using this project.
├── .github/workflows   <- Where the CI lives
├── datasets/external   <- Bash scripts to download external datasets
├── docs                <- Sphinx HTML documentation
├── nlpretext           <- Main Package. This is where the code lives
│   ├── preprocessor.py <- Main preprocessing script
│   ├── augmentation    <- Text augmentation script
│   ├── basic           <- Basic text preprocessing 
│   ├── social          <- Social text preprocessing
│   ├── token           <- Token text preprocessing
│   ├── _config         <- Where the configuration and constants live
│   └── _utils          <- Where preprocessing utils scripts lives
├── tests               <- Where the tests lives
├── setup.py            <- makes project pip installable (pip install -e .) so the package can be imported
├── requirements.txt    <- The requirements file for reproducing the analysis environment, e.g.
│                          generated with `pip freeze > requirements.txt`
└── pylintrc            <- The linting configuration file
Comments
  • Bump actions/cache from 2.1.6 to 3.2.1

    Bump actions/cache from 2.1.6 to 3.2.1

    Bumps actions/cache from 2.1.6 to 3.2.1.

    Release notes

    Sourced from actions/cache's releases.

    v3.2.1

    What's Changed

    Full Changelog: https://github.com/actions/cache/compare/v3.2.0...v3.2.1

    v3.2.0

    What's Changed

    New Contributors

    Full Changelog: https://github.com/actions/cache/compare/v3...v3.2.0

    v3.2.0-beta.1

    What's Changed

    v3.1.0-beta.3

    What's Changed

    • Bug fixes for bsdtar fallback, if gnutar not available, and gzip fallback, if cache saved using old cache action, on windows.

    Full Changelog: https://github.com/actions/cache/compare/v3.1.0-beta.2...v3.1.0-beta.3

    ... (truncated)

    Changelog

    Sourced from actions/cache's changelog.

    3.2.1

    • Update @actions/cache on windows to use gnu tar and zstd by default and fallback to bsdtar and zstd if gnu tar is not available. (issue)
    • Added support for fallback to gzip to restore old caches on windows.
    • Added logs for cache version in case of a cache miss.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies github_actions 
    opened by dependabot[bot] 0
  • Bump python from 3.9.7-slim-buster to 3.11.1-slim-buster in /docker

    Bump python from 3.9.7-slim-buster to 3.11.1-slim-buster in /docker

    Bumps python from 3.9.7-slim-buster to 3.11.1-slim-buster.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft docker dependencies 
    opened by dependabot[bot] 0
  • The current release is not functional as emoji lib has changed

    The current release is not functional as emoji lib has changed

    🐛 Bug Report

    🔬 How To Reproduce

    Steps to reproduce the behavior:

    1. install nlpretext from pip (1.1.0)
    2. run from nlpretext._config import constants

    Code sample

    Environment

    • OS: macOS Silicon
    • Python version: 3.7, 3.8, 3.9

    📈 Expected behavior

    EMOJI_PATTERN = _emoji.get_emoji_regexp()

    AttributeError: module 'emoji' has no attribute 'get_emoji_regexp'

    bug 
    opened by Guillaume6606 1
  • Bump release-drafter/release-drafter from 5.15.0 to 5.21.1

    Bump release-drafter/release-drafter from 5.15.0 to 5.21.1

    Bumps release-drafter/release-drafter from 5.15.0 to 5.21.1.

    Release notes

    Sourced from release-drafter/release-drafter's releases.

    v5.21.1

    What's Changed

    Dependency Updates

    Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.21.0...v5.21.1

    v5.21.0

    What's Changed

    New

    Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.20.1...v5.21.0

    v5.20.1

    What's Changed

    Bug Fixes

    Documentation

    Dependency Updates

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies github_actions 
    opened by dependabot[bot] 0
  • Bump cloudpickle from 2.0.0 to 2.2.0

    Bump cloudpickle from 2.0.0 to 2.2.0

    Bumps cloudpickle from 2.0.0 to 2.2.0.

    Changelog

    Sourced from cloudpickle's changelog.

    2.2.0

    2.1.0

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    draft dependencies python 
    opened by dependabot[bot] 0
Releases(1.1.0)
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022
Gold standard corpus annotated with verb-preverb connections for Hungarian.

Hungarian Preverb Corpus A gold standard corpus manually annotated with verb-preverb connections for Hungarian. corpus The corpus consist of the follo

RIL Lexical Knowledge Representation Research Group 3 Jan 27, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023