Pipeline for fast building text classification TF-IDF + LogReg baselines.

Last update: Dec 07, 2022

Overview

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

install pipeline:

pip install text-classification-baseline

run pipeline:

either in terminal:

text-clf-train

or in python:

import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

text
target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

terminal:

text-clf-train --path_to_config config.yaml

python:

import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with TF-IDF and LogReg steps
target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
config.yaml - config that was used to train the model
logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

285 Jan 2, 2023

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

160 Feb 9, 2021

Comments

release v0.1.4
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

enhancement
opened by dayyass 1
release v0.1.6

fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)
bug enhancement

opened by dayyass 0
release v0.1.5
added lemmatization (#66)

added token frequency support (#84)

added threshold selection for binary classification (#79)

added arbitrary save folder name (#80)

enhancement
opened by dayyass 0
release v0.1.5
added lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

enhancement
opened by dayyass 0

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)
Release v0.1.6

fixed token frequency support (add token frequency support #85)

fixed threshold selection for binary classification (add threshold selection for binary classification #86)

Source code(tar.gz)
Source code(zip)
v0.1.5(Oct 21, 2021)
Release v0.1.5 🥳🎉🍾

added pymorphy2 lemmatization (#81)

added token frequency support (#85)

added threshold selection for binary classification (#86)

added arbitrary save folder name (#83)

pymorphy2 lemmatization (config.yaml)

# preprocessing # (included in resulting model pipeline, so preserved for inference) preprocessing: lemmatization: pymorphy2

token frequency support

text_clf.token_frequency.get_token_frequency(path_to_config) -
get token frequency of train dataset according to the config file parameters

threshold selection for binary classification

text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
get precision and recall metrics for precision-recall curve

text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve

text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
plot precision-recall curve

text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
plot roc curve

text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
plot precision, recall, f1-score curves for probability thresholds

arbitrary save folder name (config.yaml)

experiment_name: model
Source code(tar.gz)
Source code(zip)
v0.1.4(Oct 10, 2021)
fixed load_20newsgroups.py (#65 #71)

added Makefile (#71)

added logging confusion matrix (#72)

replaced all "valid" occurrences with "test" (#74)

updated docstrings (#77)

changed python interface - train function returns model and target_names_mapping (#78)

Source code(tar.gz)
Source code(zip)
v0.1.3(Sep 2, 2021)
added hyper-parameters tuning (#58)

Source code(tar.gz)
Source code(zip)
v0.1.2(Aug 19, 2021)
fixed bug with multiple logging (#55)

Source code(tar.gz)
Source code(zip)
v0.1.1(Aug 11, 2021)
added logging (#43)

added unittests (#49)

added CI with linter, tests, codecov (#46 #49)

added docker (#48)

Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 7, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Owner

Dani El-Ayyass

NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU

GitHub Repository https://pypi.org/project/text-classification-baseline/

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

1 Jan 01, 2022

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

884 Nov 11, 2022

Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

4 Feb 13, 2022

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

70 Dec 12, 2022

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

16 Nov 12, 2022

Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

1 Dec 10, 2021

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

75 Dec 19, 2022

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

14 Dec 19, 2022

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

2 Feb 08, 2022

MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

19 Dec 17, 2022

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022

A paper list for aspect based sentiment analysis.

Aspect-Based-Sentiment-Analysis A paper list for aspect based sentiment analysis. Survey [IEEE-TAC-20]: Issues and Challenges of Aspect-based Sentimen

419 Dec 20, 2022

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

41 Dec 30, 2022

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023

숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

3 Jan 25, 2022

Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

1 Nov 15, 2021

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

101 Dec 30, 2022

HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

318 Jan 04, 2023

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Related tags

Overview

Text Classification Baseline

Usage

Config

Output

Requirements

Citation

You might also like...

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Pipeline for chemical image-to-text competition

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Text vectorization tool to outperform TFIDF for classification tasks

Text vectorization tool to outperform TFIDF for classification tasks

Comments

release v0.1.4

release v0.1.6

release v0.1.5

release v0.1.5

Releases(v0.1.6)

v0.1.6(Nov 6, 2021)

Release v0.1.6

v0.1.5(Oct 21, 2021)

Release v0.1.5 🥳🎉🍾

pymorphy2 lemmatization (config.yaml)

token frequency support

threshold selection for binary classification

arbitrary save folder name (config.yaml)

v0.1.4(Oct 10, 2021)

v0.1.3(Sep 2, 2021)

v0.1.2(Aug 19, 2021)

v0.1.1(Aug 11, 2021)

v0.1.0(Aug 7, 2021)

Owner

Dani El-Ayyass

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Subtitle Workshop (subshop): tools to download and synchronize subtitles

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Bnagla hand written document digiiztion

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

MicBot - MicBot uses Google Translate to speak everyone's chat messages

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Interpretable Models for NLP using PyTorch

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

A paper list for aspect based sentiment analysis.

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

숭실대학교 컴퓨터학부 전공종합설계프로젝트

Text to speech converter with GUI made in Python.

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

HuggingTweets - Train a model to generate tweets