Snowball compiler and stemming algorithms

Last update: Jan 07, 2023

Related tags

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Owner

Snowball Stemming language and algorithms

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Repository for Project Insight: NLP as a Service

基于“Seq2Seq+前缀树”的知识图谱问答

PyTorch implementation of Tacotron speech synthesis model.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Creating a Feed of MISP Events from ThreatFox (by abuse.ch)

Backend for the Autocomplete platform. An AI assisted coding platform.

Question answering app is used to answer for a user given question from user given text.

Russian words synonyms and antonyms

AutoGluon: AutoML for Text, Image, and Tabular Data

precise iris segmentation

Machine learning models from Singapore's NLP research community

Input english text, then translate it between languages n times using the Deep Translator Python Library.

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

End-to-end MLOps pipeline of a BERT model for emotion classification.

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

Code for the paper "Flexible Generation of Natural Language Deductions"

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Uses Google's gTTS module to easily create robo text readin' on command.