Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
A powerfull Zee5 Downloader Bot With Permeneant Thumbnail Support 💯 With Love From NexonHex

Zᴇᴇ5 DL A ᴘᴏᴡᴇʀғᴜʟʟ Zᴇᴇ5 Dᴏᴡɴʟᴏᴀᴅᴇʀ Bᴏᴛ Wɪᴛʜ Pᴇʀᴍᴇɴᴇᴀɴᴛ Tʜᴜᴍʙɴᴀɪʟ Sᴜᴘᴘᴏʀᴛ 💯 Wɪᴛʜ Lᴏᴠᴇ Fʀᴏᴍ NᴇxᴏɴHᴇx Wʜᴀᴛ Cᴀɴ I Dᴏ ? • ɪ ᴄᴀɴ Uᴘʟᴏᴀᴅ ᴀs ғɪʟᴇ/ᴠɪᴅᴇᴏ ғʀᴏᴍ

Psycharmers 4 Jan 19, 2022
Uploader-Bot - A Modified Telegram Url Uploader Bot With Mongodb, Zee5, Sonyliv Support and Many Other Yt-dlp Sites

𝚁𝚎𝚚𝚞𝚒𝚛𝚎𝚍 𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎𝚜 🔊 APP_ID API_HASH TG_BOT_TOKEN DATABASE_URL

11 Sep 10, 2022
WaifuGen - A program made in waifuGen that generates SFW and NSFW waifus from the waifu.pics API

waifuGen A program made in waifuGen that generates SFW and NSFW waifus from the

1 Jan 05, 2022
Terminal Bot which will Execute your Commands From telegram bot!

Terminal-Bot see this bot alive: https://t.me/HerokuTerminal_Bot With this bot you can execute system commands on your server. how to config? clone or

Moshe 41 Dec 09, 2022
A telegram bot for generate fake details. Written in python using telethon

FakeDataGenerator A telegram bot for generate fake details. Written in python using telethon. Mandatory variables API_HASH Get it from my telegram.org

Oxidised-Man 6 Dec 19, 2021
OpenSea Bulk Uploader And Trader 100000 NFTs (MAC WINDOWS ANDROID LINUX) Automatically and massively upload and sell your non-fungible tokens on OpenSea using Python Selenium

OpenSea Bulk Uploader And Trader 100000 NFTs (MAC WINDOWS ANDROID LINUX) Automatically and massively upload and sell your non-fungible tokens on OpenS

ERC-7211 3 Mar 24, 2022
Raid ToolBox (RTB) is a big toolkit of Spamming/Raiding/Token management tools for discord.

This code is very out of date and not very good, feel free to make it into something better. (we check the github page every 5 years to pulls your PRs

2 Oct 03, 2021
TORNADO CASH Proxy Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX) ⭐️ A ful

Crypto Trader 1 Jan 06, 2022
Python script to decode the EU Covid-19 vaccine certificate

vacdec Python script to decode the EU Covid-19 vaccine certificate This script takes an image with a QR code of a vaccine certificate as the parameter

Hanno Böck 244 Nov 30, 2022
471 Dec 24, 2022
A Telegram Bot with(Forwarder Bot + User Bot + More Features )

A Telegram Bot with(Forwarder Bot + User Bot + More Features )

Kaif 3 Feb 16, 2022
MCNameBot is a fast discord bot that is used to check the availability of a Minecraft name with a simple command.

MCNameBot MCNameBot is a fast discord bot that is used to check the availability of a Minecraft name with a simple command. If you would like to just

Killin 2 Oct 11, 2022
Elemeno.ai standard development kit in Python

Overview A set of glue code and utilities to make using elemeno AI platform a smooth experience Free software: Apache Software License 2.0 Installatio

Elemeno AI 3 Dec 14, 2022
Gnosis-py includes a set of libraries to work with Ethereum and Gnosis projects

Gnosis-py Gnosis-py includes a set of libraries to work with Ethereum and Gnosis projects: EthereumClient, a wrapper over Web3.py Web3 client includin

Gnosis 93 Dec 23, 2022
CyberTKR - CyberTK-API

CyberTKR - CyberTK-API

TKR 2 Apr 08, 2022
🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket

🛰️ Scripts démontrant l'utilisation de l'imagerie RADARSAT-1 à partir d'un seau AWS | 🛰️ Scripts demonstrating the use of RADARSAT-1 imagery from an AWS bucket

Agence spatiale canadienne - Canadian Space Agency 4 May 18, 2022
gBasic - The easy multiplatform bot

gBasic The easy multiplatform bot gBasic is the module at the core of @GianpiertoldaBot, maintained with 3 for the entire community by the Stockdroid

Stockdroid Fans 5 Nov 03, 2021
Weather telegram bot with aiogram, on Russian language

weather_bot Weather telegram bot with aiogram, on Russian language #RU Бот по определению погоды в Telegram, написана на библиотеке aiogram, весь инте

LinkxWan 0 Jan 06, 2022
Disqus API bindings for Python

disqus-python Let's start with installing the API: pip install disqus-python Use the API by instantiating it, and then calling the method through dott

DISQUS 163 Oct 14, 2022
A Telegram bot that can stream Telegram files to users over HTTP

AK-FILE-TO-LINK-BOT A Telegram bot that can stream Telegram files to users over HTTP. Setup Install dependencies (see requirements.txt), configure env

3 Dec 29, 2021