Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Muzan-Discord-Nuker - A simple discord server nuker in python

Muzan-Discord-Nuker This is Just a simple discord server nuker in python. ✨ Feat

Afnan 3 May 14, 2022
A Python API for Connected 2

connected API for Connected 2 api for the { connected 2 } programmer : api report api follow api check username api forget password api Search api cha

2 Jun 05, 2022
A very basic starter bot based on CryptoKKing with a small balance

starterbot A very basic starter bot based on CryptoKKing with a small balance, use at your own risk. I have since upgraded this script significantly a

Danny Kendrick 2 Dec 05, 2021
A melhor maneira de atender seus clientes no Telegram!

Clientes.Chat Sobre o serviço Configuração Banco de Dados Variáveis de Ambiente Docker Python Heroku Contribuição Sobre o serviço A maneira mais organ

Gabriel R F 10 Oct 12, 2022
WhatsApp Multi Device Client

WhatsApp Multi Device Client

23 Nov 18, 2022
A telegram mirror bot with an integrated RSS feed reader.

About What is this repo? This is a slightly modified fork which includes some extra features & memes added to my liking. How's this different from the

11 May 15, 2022
HackerNews and Reddit in one placce

EDIT: this project is 3.5 years old. I found it sad it's just laying around, so I did some minimal fixes and deployed it. Hope you enjoy! (PR's welcom

Hugo Montenegro 1 Nov 13, 2021
Lib for create and show QRCode to PIX, you can show this code in another applications for payment by final consumer.

Biblioteca para a geração de codigos QR (BRCode como chamados na documentação do BACEN) a fins de facilitar a exibição para pagamentos ao consumidor.

João Camargo 13 Oct 05, 2022
Exports saved posts and comments on Reddit to a csv file.

reddit-saved-to-csv Exports saved posts and comments on Reddit to a csv file. Columns: ID, Name, Subreddit, Type, URL, NoSFW ID: Starts from 1 and inc

70 Jan 02, 2023
Asynchronous python aria2 mirror bot Telegram.

aioaria2-mirror-bot A Bot for Telegram made with Python using Pyrogram library. It needs Python 3.9 or newer to run. THIS BOT IS INTENDED TO BE USED O

Adek 85 Jan 03, 2023
A Discord API Wrapper for Userbots/Selfbots written in Python.

DisCum A simple, easy to use, non-restrictive, synchronous Discord API Wrapper for Selfbots/Userbots written in Python. -using requests and websockets

Liam 450 Dec 27, 2022
Tweet stream in OBS browser source

Tweetron TweetronはOBSブラウザーソースを使用してツイートを画面上に表示するツールソフトです Windowsのみ対応 (Windows10動作確認済) ダウンロード こちらから最新版をダウンロードしてください (現在ベータテスト版を配布しています) Download ver0.0.

Cube 0 Apr 05, 2022
Практическая работа 6 - Документирование кода

Практическая работа №6 ПСП – правильная скобочная последовательность – последовательность из открывающих «(« и закрывающих «)» круглых скобок. Програм

0 Apr 14, 2022
Proxy-Bot - Python proxy bot for telegram

Proxy-Bot 🤖 Proxy bot between the main chat and a newcomer, allows all particip

Anton Shumakov 3 Apr 01, 2022
A simple bot discord in PY with moderation controls

Voila un bot discord en py avec les commandes simples de modération tout simplement faut changer les lignes 70 vous mettez votre token de votre bot 53

Ethan 1 Nov 20, 2021
The first open-source PyTgCalls-based project.

SU Music Player — The first open-source PyTgCalls based Pyrogram bot to play music in voice chats Requirements FFmpeg NodeJS 15+ Python 3.7+ Deploymen

Calls Music 74 Nov 19, 2022
A simple Telegram bot that converts a phone number to a direct whatsapp chat link

Open in WhatsApp I was using a great app to open a whatsapp chat with a given number directly without saving that number in my contact list, but I fel

Pathfinder 19 Dec 24, 2022
🚀🔥使用Python连接阿里云盘, 实现了官方大部分功能 👍👍

aligo 🚀 🔥 使用Python连接阿里云盘, 实现了官方大部分功能 👍 👍 为了完善代码提示, 方便大家代码书写, aligo 引入了一些 python 3.8 的新特性, 所以要求 python = 3.8.* pip install aligo 或 pip install ali

455 Jan 08, 2023
A simple telegram bot to recognize lengthy voice files to text and vice versa with multiple language support.

Voicebot A simple Telegram bot to convert lengthy voice clips to text and vice versa with supporting languages. Mandatory Variables API_HASH - Yo

Renjith Mangal 12 Oct 21, 2022
A FORKED AND Modded version of TL:GD for 🅱️3R0K🧲support

for support join here working example group Leech Here For Any Issues/Imrovements or Discussions go here or here Please Leave A star And Fork this Rep

XcodersHub 165 Mar 12, 2022