Pipeline for training LSA models using Scikit-Learn.

Last update: Sep 05, 2022

Overview

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 7, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

First Release! 🥳🎉🍾
Source code(tar.gz)
Source code(zip)

Pipeline for training LSA models using Scikit-Learn.

Related tags

Overview

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

You might also like...

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Pipeline for chemical image-to-text competition

Pipeline for fast building text classification TF-IDF + LogReg baselines.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Releases(v0.1.0)

v0.1.0(Oct 8, 2021)

Owner

Dani El-Ayyass

Some embedding layer implementation using ivy library

Extracting Summary Knowledge Graphs from Long Documents

Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Mapping a variable-length sentence to a fixed-length vector using BERT model

A paper list of pre-trained language models (PLMs).

To be a next-generation DL-based phenotype prediction from genome mutations.

Azure Text-to-speech service for Home Assistant

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

Code for using and evaluating SpanBERT.

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

An ActivityWatch watcher to pose questions to the user and record her answers.

A simple chatbot based on chatterbot that you can use for anything has basic features

Global Rhythm Style Transfer Without Text Transcriptions

Text-Based zombie apocalyptic decision-making game in Python

Twitter Sentiment Analysis using #tag, words and username

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

The RWKV Language Model

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法