File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Related tags

Text Data & NLPtf-idf
Overview

File-based TF-IDF

Calculates keywords in a document, using a word corpus.

Why?

Because I found myself with hundreds of plain text files, with no way to know what each one contains. I then recalled this thing called TF-IDF from university, but found no utility that operates on files. Hence, here we are.

How?

Basically, each word in the current document gets a score. The score increases each time the word it appears in this document, and decreases each time it appears in another document. The words with the highest scores will thus (theoretically) be the keywords.

Of course, this requires you to have many other documents (the corpus) to compare with. They should contain approximately the same language. For example, it makes sense to split chapters in a book and use those as the corpus. Use your senses.

Installation

Copy tfidf.py to some location on $PATH

Usage

usage: tfidf [-h] [--json] [--min-df MIN_DF] [-n N | --all] --input-document INPUT_DOCUMENT [corpus ...]

Calculates keywords in a document, using a word corpus.

positional arguments:
  corpus                corpus files (optional but highly reccommended)

options:
  -h, --help            show this help message and exit
  --json, -j            get output as json
  --min-df MIN_DF       if a word occurs less than this number of times in the corpus, it's not considered (default: 2)
  -n N                  limit output to this many words (default: 10)
  --all                 Don't limit the amount of words to output (default: false)
  --input-document INPUT_DOCUMENT, -i INPUT_DOCUMENT
                        document file to extract keywords from

Examples

To get the top 10 keywords for chapter 1 of Moby Dick:

# assume that *.txt matches all other chapters of mobydick
$ tfidf -n 10 -i mobydick_chapter1.txt *.txt

WORD             TF_IDF           TF               
passenger        0.003            0.002            
whenever         0.003            0.002            
money            0.003            0.002            
passengers       0.002            0.001            
purse            0.002            0.001            
me               0.002            0.011            
image            0.002            0.001            
hunks            0.002            0.001            
respectfully     0.002            0.001            
robust           0.002            0.001            
-----
num words in corpus: 208425
$ tfidf --all -j -i mobydick_chapter1.txt *.txt
[
    {
        "word": "lazarus",
        "tf_idf": 0.0052818627137794375,
        "tf": 0.0028169014084507044
    },
    {
        "word": "frost",
        "tf_idf": 0.004433890895007659,
        "tf": 0.0028169014084507044
    },
    {
        "word": "bedford",
        "tf_idf": 0.0037492766733561254,
        "tf": 0.0028169014084507044
    },
    ...
]

TF-IDF equations

t — term (word)
d — document (set of words)
corpus — (set of documents)
N — number of documents in corpus

tf(t,d) = count of t in d / number of words in d
df(t) = occurrence of t in N documents
idf(t) = N/df(t)

tf_idf(t, d) = tf(t, d) * idf(t)
Owner
Jakob Lindskog
Jakob Lindskog
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Resources for "Natural Language Processing" Coursera course.

Natural Language Processing course resources This github contains practical assignments for Natural Language Processing course by Higher School of Eco

Advanced Machine Learning specialisation by HSE 1.1k Jan 01, 2023
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022