Textpipe: clean and extract metadata from text

Overview

textpipe: clean and extract metadata from text

Build Status

The textpipe logo

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

  • Designed for use in production pipelines without adult supervision.
  • Rechargeable batteries included: provide sane defaults and clear examples to adapt.
  • A uniform interface with thin wrappers around state-of-the-art NLP packages.
  • As language-agnostic as possible.
  • Bring your own models.

Features

  • Clean raw text by removing HTML and other unreadable constructs
  • Identify the language of text
  • Extract the number of words, number of sentences, named entities from a text
  • Calculate the complexity of a text
  • Obtain text metadata by specifying a pipeline containing all desired elements
  • Obtain sentiment (polarity and a subjectivity score)
  • Generates word counts
  • Computes minhash for cheap similarity estimation of documents

Installation

It is recommended that you install textpipe using a virtual environment.

python3 -m venv .venv
  • Using virtualenv.
virtualenv venv -p python3.6
  • Using virtualenvwrapper
mkvirtualenv textpipe -p python3.6
  • Install textpipe using pip.
pip install textpipe
  • Install the required packages using requirements.txt.
pip install -r requirements.txt

A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.12.1

  • Bumps redis, tqdm, pyling

0.12.0

  • Bumps versions of many dependencies including textacy. Results for keyterm extraction changed.

0.11.9

  • Exposes arbitrary SpaCy ents properties

0.11.8

  • Exposes SpaCy's cats attribute

0.11.7

  • Bumps spaCy and redis versions

0.11.6

  • Fixes bug where gensim model is not cached in pipeline

0.11.5

  • Raise TextpipeMissingModelException instead of KeyError

0.11.4

  • Bumps spaCy and datasketch dependencies

0.11.1

  • Replaces codacy with pylint on CI
  • Fixes pylint issues

0.11.0

  • Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

  • Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

  • Removes non standard utf chars before detecting language

0.8.5

  • Bump spaCy to 2.1.3

0.8.4

  • Fix broken install command

0.8.3

  • Fix broken install command

0.8.2

  • Fix copy-paste error in word vector aggregation (#118)

0.8.1

  • Fixes bugs in several operations that didn't accept kwargs

0.8.0

  • Bumps Spacy to 2.1

0.7.2

  • Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

  • change operation's registry from list to dict
  • global pipeline data is available across operations via the context kwarg
  • load custom operations using register_operation in pipeline
  • custom steps (operations) with arguments
Owner
Textpipe
Textpipe
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Simplemma: a simple multilingual lemmatizer for Python Purpose Lemmatization is the process of grouping together the inflected forms of a word so they

Adrien Barbaresi 70 Dec 29, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

Jonas Djondo 1 Nov 18, 2021
Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

Todd Birchard 3 Jan 17, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022