Simple Similarities Service

Overview

simsity

Simsity is a Super Simple Similarities Service[tm].
It's all about building a neighborhood. Literally!

This repository contains simple tools to help in similarity retreival scenarios by making a convient wrapper around encoding strategies as well as nearest neighbor approaches. Typical usecases include early stage bulk labelling and duplication discovery.

Warning

Alpha software. Expect things to break. Do not use in production.

Quickstart

This is the basic setup for this package.

import pandas as pd

from simsity.service import Service
from simsity.indexer import PyNNDescentIndexer
from simsity.preprocessing import Identity, ColumnLister


# The Indexer handles the nearest neighbor search
# The Encoder handles the encoding of the datapoints
service = Service(
    indexer=PyNNDescentIndexer(metric="euclidean"),
    encoder=CountVectorizer()
)

# The encoder defines how we encode the data going in.
encoder = make_pipeline(
    ColumnLister(column="text"),
    CountVectorizer()
)

# The indexer handles the nearest neighbor lookup.
indexer = PyNNDescentIndexer(metric="euclidean", n_neighbors=2)

# The service combines the two into a single object.
service_clinc = Service(
    encoder=encoder,
    indexer=indexer,
)

# We can now train the service.
df_clinc = pd.read_csv("tests/data/clinc-data.csv")
service_clinc.train_from_dataf(df_clinc, features=["text"])

# Query the datapoints
service.query("give me directions", n_neighbors=20)

# Save the entire system
service.save("/tmp/simple-model")

# You can also load the model now.
reloaded = Service.load("/tmp/simple-model")

# We can also host it as a web service
reloaded.serve(host='0.0.0.0', port=8080)

# You can now POST to http://0.0.0.0:8080/query with payload:
# {"query": {"text": "hello there"}, "n_neighbors": 20}
Comments
  • Add support for pretrained encoders and transformed data

    Add support for pretrained encoders and transformed data

    First of all this project looks great! I've taken an initial stab at #12 and also tried to add support querying data that has already been transformed. If you have data that you've already transformed (e.g. a UMAP embedding), you probably don't want to rerun encoder.transform again. In this case you want to index the transformed data and query it directly.

    This is just a first crack so happy to incorporate any feedback you might have!

    opened by gclen 10
  • embetter: better embeddings

    embetter: better embeddings

    This is conceptual work in progress. The maintainer is actively researching this, please do not work on it.

    Problem Statement

    When you submit where is my phoone and you get similarities you may get things like:

    • where is my phone
    • where is my credit card

    Depending on your task, either the "where is" part of the sentence is more important or the "phone" part is more important. The encoder, however, may be very brittle when it comes to spelling errors. So to put it more generally;

    image

    The similarity in an embedded space in our case is very much "general". I'm using "general" here, as opposed to "specific" to indicate that these similarities have been constructed without having a task in mind.

    Similar Issue

    Suppose that we are deduplicating and we have a zipcode, city, first-, and last-name. How would our encoding be able to understand that having the same city is not a strong signal while having the first name certainly is? Can we really expect a standard encoding to understand this? Without labels ... I think not.

    opened by koaning 3
  • Add `Identity` as default encoder for Service.

    Add `Identity` as default encoder for Service.

    As mentioned in https://github.com/koaning/simsity/pull/13:

    I think the refit parameter should go in the Service() call. I think there should also be a parameter somewhere to avoid calling .transform() if the data has already been transformed. Do you think it is worth adding an additional parameter to Service() and keeping the indexed_from_transformed_data method?

    It's a fair remark. I think preventing a transfrom() is fair, but the solution would be to have an Identity() transformer that just keeps the data as-is. This would also make a great default value for the encoder.

    Made this issue to track progress and to discuss the approach.

    opened by koaning 2
  • Codecalm tutorial on simsity

    Codecalm tutorial on simsity

    Hi Vincent. Since I discovered you my barrier towards Python has eroded! Thank you. I'm a Data Scientist who wants to check if simsity can help with retrieving similar regions based on environmental variables.

    opened by FrancyJGLisboa 2
  • Update indexer

    Update indexer

    Hi! Are there any plans to add support for updating the indexer, i.e. add new documents without retraining the entire pipeline? Would be a very useful feature .

    from simsity.service import Service
    
    service = Service(
        indexer=indexer,
        encoder=encoder
    )
    
    service.train_from_dataf(df, features=["text"])
    
    ....
    
    service.update(new_docs, features=["text"])  # <- this
    
    
    opened by nthomsencph 1
  • New API

    New API

    I think the original design was flawed and this project should stick to the scikit-learn API more.

    from simsity.preprocessing import Grab
    from simsity.service import Service
    from simsity.indexer import (AnnoyIndexer, PynnDescentIndexed, NMSlibIndexer,
                                 PineconeIndexer, QdrantIndexer, WeviateIndexer)
    
    
    encoder = make_pipeline(
        make_union(
            make_pipeline(Grab("text"), SentenceEncoder()),
            make_pipeline(Grab("title"), SentenceEncoder())
        )
    )
    
    service = Service(encoder, indexer, batch_size=50)
    service.index(X)
    items, dists = service.query(X, n=10)
    
    opened by koaning 0
  • Education Day Goals

    Education Day Goals

    • [x] add typing + type checker
    • [x] add tests for the minhash tools
    • [ ] collect more useful datasets
    • [x] automate the benchmarking
    • [x] write getting started guides
    • [ ] record a quick demo for colleagues
    • [ ] add github actions stash
    opened by koaning 0
  • added-components

    added-components

    Adding the MinHash components. This is also an amazing opportunity to:

    • [ ] add types and a type checker
    • [ ] add some standard tests for indexers
    • [ ] add a script to run some benchmarks on the clinc dataset
    opened by koaning 0
Releases(0.1.1)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot

Adv Auto Filter Bot V2 This Is Just An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot.. Just Sent Any Text As Query It Wi

0 Dec 18, 2021
This app is providing you to track some online products' prices via GMAIL.

Price Tracking App variables and descriptions of that code is in Turkish language. but we're working on translate them into English. This app is provi

Abdullah Aslan 1 Dec 11, 2021
Discord Unverified Token Gen

Discord-Unverified-Token-Gen This is a token gen that was made in an hour and just generates unverified tokens, most will be locked. Usage: in cmd jus

Aran 2 Oct 23, 2022
This is a music bot for discord written in python

this is a music bot for discord written in python, it is designed for educational use ONLY, I do not take any responsibility for uses outside of educational use

5 Dec 24, 2021
This code will guide you on how you can make your own Twitter Bot.

This code will guide you on how you can make your own Twitter Bot. This bot retweets, and likes to tweet with a particular word, here Python3.

Kunal Diwan 1 Oct 14, 2022
Ark API Wrapper in Python

Pythark Ark API Wrapper in Python. Built with Python Requests Installation Pythark uses Arky to create a new transaction, if you want to use this feat

Jolan 14 Mar 11, 2021
Discord bot ( discord.py ), uses pandas library from python for data-management.

Discord_bot A Best and the most easy-to-use Discord bot !! Some simple basic auto moderations, Chat functions. It includes a game similar to Casino, g

Jaitej 4 Aug 30, 2022
JAWS Pankration 2021 - DDD on AWS Lambda sample

JAWS Pankration 2021 - DDD on AWS Lambda sample What is this project? This project contains sample code for AWS Lambda with domain models. I presented

Atsushi Fukui 21 Mar 30, 2022
Bulk NFT uploader to OpenSea!

Bulk NFT Uploader Description Simple easy peasy python script which logins to opensea account using metamask and bulk uploads NFT to your default coll

Lakshya Khera 25 May 23, 2022
Python client for QIWI payment system

Pyqiwi Lib for QIWI payment system Installation pip install pyqiwi Usage from decimal import Decimal from datetime import datetime, timedelta from p

Andrey 12 Jun 03, 2022
Bot to notify when vaccine appointments are available

Vaccine Watch Bot to notify when vaccine appointments are available. Supports checking Hy-Vee, Walgreens, CVS, Walmart, Cosentino's stores (KC), and B

Peter Carnesciali 37 Aug 13, 2022
Python Discord Server Nuker

Untitled Nuker Python Discord Server Nuker Features: Ban Everyone Kick Everyone Rename Everyone Spam To All Channels Delete All Channels Delete All Ro

22 Dec 22, 2022
My telegram bot to download Instagram Profiles

Instagram Profile Get for Telegram My telegram bot to download Instagram Profiles First you have to get a telegrm bot api key from @BotFather Then you

Ali Yoonesi 2 Sep 22, 2022
Use Node JS Keywords In Python!!!

Use Node JS Keywords In Python!!!

Sancho Godinho 1 Oct 23, 2021
A Pancakeswap and Uniswap trading client (and bot) with limit orders, marker orders, stop-loss, custom gas strategies, a GUI and much more.

Pancakeswap and Uniswap trading client Adam A A Pancakeswap and Uniswap trading client (and bot) with market orders, limit orders, stop-loss, custom g

570 Mar 09, 2022
PyMusic Player is a music player written in python3.

PyMusic Player is a music player written in python3. It harvests r

PythonSerious 2 Jan 30, 2022
Polar devices Python API and CLI.

loophole - Polar devices API About Python API for Polar devices. Command line interface included. Tested with: A360 Loop M400 Installation pip install

[roscoe] 145 Sep 14, 2022
Python + AWS Lambda Hands OnPython + AWS Lambda Hands On

Python + AWS Lambda Hands On Python Criada em 1990, por Guido Van Rossum. "Bala de prata" (quase). Muito utilizado em: Automatizações - Selenium, Beau

Marcelo Ortiz de Santana 8 Sep 09, 2022
This discord bot preview user 42intra login picture.

42intra_Pic BOT This discord bot preview user 42intra login picture. created by: @YOPI#8626 Using: Python 3.9 (64-bit) (You don't need 3.9 but some fu

Zakaria Yacoubi 7 Mar 22, 2022
An NFTGenerator to generate NFTs and send them to nft.storage

NFTGenerator Table of Contents Overview Installation Introduction Features Reflection Issues & bug reports Show your support Credits Overview The NFTG

3 Mar 14, 2022