A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

The ibet-Prime security token management system for ibet network.

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

KoBART model on huggingface transformers

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Python implementation of TextRank for phrase extraction and summarization of text documents

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

تولید اسم های رندوم فینگیلیش

🤖 Basic Financial Chatbot with handoff ability built with Rasa

Sequence modeling benchmarks and temporal convolutional networks

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

scikit-learn wrappers for Python fastText.

Py65 65816 - Add support for the 65C816 to py65

Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Transformation spoken text to written text

AudioCLIP Extending CLIP to Image, Text and Audio