Biterm Topic Model (BTM): modeling topics in short texts

Last update: Dec 30, 2022

Overview

Biterm Topic Model

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

cython
numpy
pandas
scipy
scikit-learn
tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments

the topic distribution for all doc is similar

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
bug help wanted good first issue

opened by JennieGerhardt 11
ERROR: Failed building wheel for bitermplus

creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug documentation

opened by QinrenK 9
Got an unexpected result in marked sample

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

``

opened by Chen-X666 7
Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

opened by Sajad7010 4

Cannot find Closest topics and Stable topics

Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)

The error is:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

Thank you.

X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))

# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)

# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)  
print(p_zd) 

# matrix of document-topics; topics vs. documents, topics vs. words probabilities 
matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.

# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)

# Stable topics indices list
print(stable_topics)

help wanted question

opened by RashmiBatra 4

Questions regarding Perplexity and Model Comparison with C++

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?
help wanted question

opened by orpheus92 3
How do I get the topic words?

Hi,

Firstly, thanks for sharing your code.

Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

Thanks in advance.
question

opened by aguinaldoabbj 3

failed building wheels

Hi!

I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      Error in sitecustomize; set PYTHONVERBOSE for traceback:
      AssertionError:
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-12-x86_64-cpython-310
      creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running egg_info
      writing src/bitermplus.egg-info/PKG-INFO
      writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
      writing requirements to src/bitermplus.egg-info/requires.txt
      writing top-level names to src/bitermplus.egg-info/top_level.txt
      reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-12-x86_64-cpython-310
      creating build/temp.macosx-12-x86_64-cpython-310/src
      creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Could this error be related to #29? I've tested on a PC and it worked though.

bug documentation

opened by alanmaehara 2

Failed building wheel for bitermplus

Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug

opened by novra 2
Calculation of nmi,ami,ri

I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

opened by gitassia 2
Implementation Guide

I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

opened by neel6762 2

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

This release contains some minor fixes and adds labels_ property to BTM model class (labels for the most probable topics for each of the documents). It also adds get_docs_top_topic method for creating DataFrames with documents and their labels.
Source code(tar.gz)
Source code(zip)
v0.6.11(Jan 8, 2022)

This release fixes the incompatibility error between bitermplus and scikit-learn.
Source code(tar.gz)
Source code(zip)
v0.6.10(Dec 16, 2021)

This release includes a number of minor fixes. Methods to select stable topics have been moved to tmplot package. Please see the updated tutorial in the documentation.
Source code(tar.gz)
Source code(zip)
v0.6.9(Aug 19, 2021)

This release introduces a function for Renyi entropy calculation (bitermplus.entropy) that can be used to estimate the optimal number of topics. For more details, read this paper.
Source code(tar.gz)
Source code(zip)
v0.6.8(Jul 23, 2021)

This release is an attempt to fix the issue with perplexity calculation yielding infinity values (#7).
Source code(tar.gz)
Source code(zip)
v0.6.7(Jul 1, 2021)
This release drops support for pyLDAvis in favor of tmplot that can be installed with pip (optional):

pip install tmplot
Source code(tar.gz)
Source code(zip)
v0.6.6(Jun 16, 2021)

This release exposes new model attributes: matrix_topics_docs_, matrix_words_topics_, and df_words_topics_ (words vs topics probabilities in a DataFrame).
Source code(tar.gz)
Source code(zip)
v0.6.5(Jun 11, 2021)

This release fixes a critical bug in the closest topics selection (get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.4(Apr 18, 2021)

This release includes memory optimizations and new metrics for topics distance measuring (see get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.3(Apr 7, 2021)

This release fixes a bug in transform method that occurred when empty documents were passed as inputs.
Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 6, 2021)

This release fixes a bug in document vs topics matrix shape (reported in this issue).
Source code(tar.gz)
Source code(zip)
v0.6.1(Apr 5, 2021)

This is a minor release that fixes buffer types mismatch on creating biterms (critical bug that appeared under Windows).
Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 4, 2021)
This is a major release that fixes critical bugs in arrays initialization. The previous versions of bitermplus are not recommended for use.

Changelog:

Arrays (n_bz, n_wz) are now properly initialized. This procedure was broken in the previous versions that led to biased results.

Data normalization (via _normalize hidden method) improved.

New NumPy random generators are used to initially assign topics to biterms.

Biterms (biterms_ model attribute) and topics probabilities (theta_ model attribute) are now available.

Biterms are now serialized as well when model is saved.

Source code(tar.gz)
Source code(zip)
v0.5.10(Mar 23, 2021)

This release improves model pickling and adds seed argument to fit() method of BTM class.
Source code(tar.gz)
Source code(zip)
v0.5.9(Mar 22, 2021)

In this release public extension attributes were converted to properties with comprehensible names and docstrings.
Source code(tar.gz)
Source code(zip)
v0.5.8(Mar 21, 2021)

This release fixed numerous bugs in the code of inference methods, optimizes memory usage, and covers most part of model fitting and inferring code with tests.
Source code(tar.gz)
Source code(zip)

Owner

Maksim Terpilowski

Research scientist

GitHub Repository https://bitermplus.readthedocs.io/en/stable/

Fastseq 基于ONNXRUNTIME的文本生成加速框架

9 Nov 09, 2021

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

12 Dec 23, 2022

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022

CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

1 May 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 01, 2023

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

216 Nov 27, 2022

Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

4.5k Jan 03, 2023

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

1 Feb 02, 2022

Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

0 Oct 25, 2022

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

57 Dec 16, 2022

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

Conditional probing: measuring usable information beyond a baseline

20 Dec 15, 2022

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022

Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

40 Dec 20, 2022

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

1 Apr 28, 2022

Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

586 Dec 24, 2022

Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022

Biterm Topic Model (BTM): modeling topics in short texts

Related tags

Overview

Biterm Topic Model

Development

Requirements

Setup

Linux and Windows

Mac OS

Example

Model fitting

Results visualization

Tutorial

Comments

topic

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

v0.6.11(Jan 8, 2022)

v0.6.10(Dec 16, 2021)

v0.6.9(Aug 19, 2021)

v0.6.8(Jul 23, 2021)

v0.6.7(Jul 1, 2021)

v0.6.6(Jun 16, 2021)

v0.6.5(Jun 11, 2021)

v0.6.4(Apr 18, 2021)

v0.6.3(Apr 7, 2021)

v0.6.2(Apr 6, 2021)

v0.6.1(Apr 5, 2021)

v0.6.0(Apr 4, 2021)

v0.5.10(Mar 23, 2021)

v0.5.9(Mar 22, 2021)

v0.5.8(Mar 21, 2021)