Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Design and build a wrapper for the Open Weather API current weather data service

Design and build a wrapper for the Open Weather API current weather data service that returns a city's temperature, with caching, also allowing for the temperature of the latest queried cities that a

Duan Rafael Ribeiro 1 Jun 27, 2022
Python client for using Prefect Cloud with Saturn Cloud

prefect-saturn prefect-saturn is a Python package that makes it easy to run Prefect Cloud flows on a Dask cluster with Saturn Cloud. For a detailed tu

Saturn Cloud 15 Dec 07, 2022
AWS CloudSaga - Simulate security events in AWS

AWS CloudSaga - Simulate security events in AWS AWS CloudSaga is for customers to test security controls and alerts within their Amazon Web Services (

Amazon Web Services - Labs 325 Dec 01, 2022
A Simple Telegram Bot To Download And Upload Files

AquaDLBot โž  I Can Download And Upload files To Telegram DEMO Copyright (C) 2020-2026 by [ema

Asia Argento 8 Feb 15, 2022
Telegram bot for downloading covid-19 vaccine certificate

cowin-certificate-bot This is the source code of @cowincertbot, A telegram bot inspired by the whatsapp bot implementation of indian government for co

ArUn Pt 30 Oct 07, 2022
Melissa Songs is a telegram bot to easily find songs sending music snippets and search tracks ๐Ÿ’ƒ๐Ÿฝ๐ŸŽต

Find songs on Telegram, like on Shazam... ๐Ÿ˜Ž Open on telegram ยท Report Bug ยท Request Feature โฌ‡๏ธ Installation To get a local copy installed and working

Joaquim Roque 21 Nov 10, 2022
Project developed as part of a selection process for the company Denox

๐Ÿ“ Tabela de conteรบdos Sobre Requisitos para rodar o projeto Instalaรงรฃo Rotas da API Observaรงรตes ๐Ÿง Sobre Projeto desenvolvido como parte de um proces

รcaro Sant'Ana 1 Mar 01, 2022
๐€ ๐ฆ๐จ๐๐ฎ๐ฅ๐š๐ซ ๐“๐ž๐ฅ๐ž๐ ๐ซ๐š๐ฆ ๐†๐ซ๐จ๐ฎ๐ฉ ๐ฆ๐š๐ง๐š๐ ๐ž๐ฆ๐ž๐ง๐ญ ๐›๐จ๐ญ ๐ฐ๐ข๐ญ๐ก ๐ฎ๐ฅ๐ญ๐ข๐ฆ๐š๐ญ๐ž ๐Ÿ๐ž๐š๐ญ๐ฎ๐ซ๐ž๐ฌ

๐‡๐จ๐ฐ ๐“๐จ ๐ƒ๐ž๐ฉ๐ฅ๐จ๐ฒ For easiest way to deploy this Bot click on the below button ๐Œ๐š๐๐ž ๐๐ฒ ๐’๐ฎ๐ฉ๐ฉ๐จ๐ซ๐ญ ๐†๐ซ๐จ๐ฎ๐ฉ ๐’๐จ๐ฎ๐ซ๐œ๐ž๐ฌ ๐†๐ž๐ง๐ž?

Mukesh Solanki 2 Oct 06, 2021
This is a Telegram Bot that tracks packages from the Brazilian Mail Service.

RastreioBot About Setup Run Contribute Contact About This is a Telegram Bot that tracks packages from the Brazilian Mail Service. It runs on Python 3

Gabriel R F 320 Dec 22, 2022
A Simple Telegram Maths Calculator Bot

Calculator-Bot-v1 A Simple Telegram Maths Calculator Bot Demo BOT LINK: Variables Variables Required Variables API_HASH: Get

แ—ชแ—ฉแ–‡K โœžOแ–‡แ—ช 1 Dec 18, 2021
A Code that can make your Discord Account 24/7!

Online-Forever Make your Discord Account Online 24/7! A Code written in Python that helps you to keep your account 24/7. The main.py is the main file.

Phantom 556 Dec 29, 2022
Shellkg-py - A temporary Repository to rewrite of shellpkg in python

Shellkg-py - A temporary Repository to rewrite of shellpkg in python

2 Jan 26, 2022
You cant check for conflicts until course enrolment actually opens. I wanted to do it earlier.

AcornICS I noticed that Acorn it does not let you check if a timetable is valid based on the enrollment cart, it also does not let you visualize it ea

Isidor Kaplan 2 Sep 16, 2021
HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions ๐Ÿ’ป๐ŸŒ๐Ÿ’ก

aws-iot-shadow-rest-api HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions ๐Ÿ’ป ๐ŸŒ ๐Ÿ’ก This simple script implements the following aw

AIIIXIII 3 Jun 06, 2022
A pyrogram simple bot for Educational purpose.

A pyrogram simple bot for Educational purpose. To Learn More check at @PyrogramBot or on Documentation Mandatory variables API_ID - Get It From my.tel

SpamShield 10 Dec 06, 2022
Graviti-python-sdk - Graviti Data Platform Python SDK

Graviti Python SDK Graviti Python SDK is a python library to access Graviti Data

Graviti 13 Dec 15, 2022
Simple Discord bot which logs several events in your server

logging-bot Simple Discord bot which logs several events in your server, including: Message Edits Message Deletes Role Adds Role Removes Member joins

1 Feb 14, 2022
Rich presence app for playstation 3. Display what game you are playing on the PS3 via Discord

PS3-Rich-Presence-for-Discord Discord Rich Presence script for PS3 consoles on HFW&HEN or CFW. Written in Python. Display what you are playing on your

17 Dec 11, 2022
Lumberjack-bot - A game bot written for Lumberjack game at Telegram platform

This is a game bot written for Lumberjack game at Telegram platform. It is devel

UฤŸur Uysal 6 Apr 07, 2022
Simple Bot With Python 3.8+ For Converstaion Your Media

Media-Conversation Simple Bot With Python 3.8+ For Converstaion Your Media

Farzin 2 Dec 06, 2021