Improving Representations via Similarities

Related tags

Miscellaneousembetter
Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments
  • [WIP] Feature/progress bar

    [WIP] Feature/progress bar

    Fixes issue #20

    • [x] Adds progress bar to all text and image embedders.
    • [x] Tests for SentenceEncoder.
    • [ ] Use perfplot for progress bar?
    • [ ] Can we ensure fast NumPy vectorization while using a progress bar?
    opened by CarloLepelaars 5
  • [BUG] `device` should be attribute on `SentenceEncoder`

    [BUG] `device` should be attribute on `SentenceEncoder`

    The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

    Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

    The scikit-learn development docs make it clear every argument should be defined as an attribute:

    every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

    Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

    Reproduction: Python 3.8 with embetter = "^0.2.2"

    se = SentenceEncoder()
    repr(se)
    

    Fix:

    Add self.device on SentenceEncoder

    class SentenceEncoder(EmbetterBase):
        .
        .
        def __init__(self, name="all-MiniLM-L6-v2", device=None):
            if not device:
                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            self.device = device
            self.name = name
            self.tfm = SBERT(name, device=self.device)
    
    opened by CarloLepelaars 4
  • Color Histograms - Additional Tricks

    Color Histograms - Additional Tricks

    This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

    To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

    opened by koaning 4
  • Support for word embeddings

    Support for word embeddings

    Hi,

    Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

    • A filename to a local embedding file (e.g., glove.6b.100d.txt)
    • Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
    • A (name of a) pooling function (e.g., "mean", "max", "sum").

    The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

    Stéphan

    opened by stephantul 3
  • [FEATURE] SpaCyEmbedder

    [FEATURE] SpaCyEmbedder

    I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

    SpaCy Docs on vector: https://spacy.io/api/doc#vector

    Example code for single string:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("This here text")
    doc.vector
    
    opened by CarloLepelaars 2
  • `get_feature_names_out` for encoders

    `get_feature_names_out` for encoders

    I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

    opened by CarloLepelaars 1
  • Remove the classification layer in timm models

    Remove the classification layer in timm models

    I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

    Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

    opened by kacperlukawski 1
  • xception mobilenet

    xception mobilenet

    https://keras.io/api/applications/

    https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

    opened by koaning 0
  • 'SentenceEncoder' object has no attribute 'device'

    'SentenceEncoder' object has no attribute 'device'

    text_emb_pipeline = make_pipeline(
      ColumnGrabber("text"),
      SentenceEncoder('all-MiniLM-L6-v2')
    )
    
    # This pipeline can also be trained to make predictions, using
    # the embedded features. 
    text_clf_pipeline = make_pipeline(
      text_emb_pipeline,
      LogisticRegression()
    )
    
    dataf = pd.DataFrame({
      "text": ["positive sentiment", "super negative"],
      "label_col": ["pos", "neg"]
    })
    
    X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
    text_clf_pipeline.fit(dataf, dataf['label_col'])
    

    This code gives this error: 'SentenceEncoder' object has no attribute 'device'

    opened by nicholas-dinicola 6
Releases(0.2.2)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
An unofficial python API for trading on the DeGiro platform, with the ability to get real time data and historical data.

DegiroAPI An unofficial API for the trading platform Degiro written in Python with the ability to get real time data and historical data for products.

Jorrick Sleijster 5 Dec 16, 2022
A timer for bird lovers, plays a random birdcall while displaying its image and info.

Birdcall Timer A timer for bird lovers. Siriema hatchling by Junior Peres Junior Background My partner needed a customizable timer for sitting and sta

Marcelo Sanches 1 Jul 08, 2022
Svg-turtle - Use the Python turtle to write SVG files

SaVaGe Turtle Use the Python turtle to write SVG files If you're using the Pytho

Don Kirkby 7 Dec 21, 2022
🛠️ Plugin to integrate Chuy with Poetry

Archived This is bundled with Chuy since v1.3.0. Poetry Chuy Plugin This plugin integrates Chuy with Poetry. Note: This only works in Poetry 1.2.0 or

Eliaz Bobadilla 4 Sep 24, 2021
SuperMario - Python programming class ending assignment SuperMario, using pygame

SuperMario - Python programming class ending assignment SuperMario, using pygame

mars 2 Jan 04, 2022
Repository voor verhalen over de woningbouw-opgave in Nederland

Analyse plancapaciteit woningen In deze notebook zetten we cijfers op een rij om de woningbouwplannen van Nederlandse gemeenten in kaart te kunnen bre

Follow the Money 10 Jun 30, 2022
Prometheus exporter for chess.com player data

chess-exporter Prometheus exporter for chess.com player data implemented via chess.com's published data API and Prometheus Python Client Example use c

Mário Uhrík 7 Feb 28, 2022
Calculadora-basica - Calculator with basic operators

Calculadora básica Calculadora com operadores básicos; O programa solicitará a d

Vitor Antoni 2 Apr 26, 2022
An example module hooking system, will be used in PySAMP.

An example module hooking system, will be used in PySAMP.

2 May 01, 2022
🎉 🎉 PyComp - Java Code compiler written in python.

🎉 🎉 PyComp Java Code compiler written in python. This is yet another compiler meant for babcock students project which was created using pure python

Alumona Benaiah 5 Nov 30, 2022
A collection of examples of using cocotb for functional verification of VHDL designs with GHDL.

At the moment, this repo is in an early state and serves as a learning tool for me. So it contains a a lot of quirks and code which can be done much better by cocotb-professionals.

T. Meissner 7 Mar 10, 2022
Traditionally, there is considerable friction for developers when setting up development environments

This self-led, half-day training will teach participants the patterns and best practices for working with GitHub Codespaces

CSE Labs at Spark 12 Dec 02, 2022
dta Convert Dict To Attributes!

dta (Dict to Attributes) dta is very small dict (or json) to attributes converter. It is only have 1 files and applied to every python versions.

Rukchad Wongprayoon 0 Dec 31, 2021
Tiling manager which runs on top of EWMH window managers.

PyTyle is an extremely versatile and extensible tiling manager that is meant to be used on top of EWMH window managers. Its feature set was modeled af

55 Jul 29, 2021
Toppr Os Auto Class Joiner

Toppr Os Auto Class Joiner Toppr os is a irritating platform to work with especially for students it takes a while and is problematic most of the time

1 Dec 18, 2021
Programmatic startup/shutdown of ASGI apps.

asgi-lifespan Programmatically send startup/shutdown lifespan events into ASGI applications. When used in combination with an ASGI-capable HTTP client

Florimond Manca 129 Dec 27, 2022
My solution for a MARL problem on a Grid Environment with Q-tables.

To run the project, run: conda create --name env python=3.7 pip install -r requirements.txt python run.py To-do: Add direction to the state space Take

Merve Noyan 12 Dec 25, 2021
An attempt at furthering Factorio Calculator to work in more general contexts.

factorio-optimizer Lets do Factorio Calculator but make it optimize. Why not use Factorio Calculator? Becuase factorio calculator is not general. The

Jonathan Woollett-Light 1 Jun 03, 2022
APC Power Usage is an application which shows power consuption overtime for UPS units manufactured by APC.

APC Power Usage Introduction APC Power Usage is an application which shows power consuption overtime for UPS units manufactured by APC. Screenshoots G

Stefan Kondinski 3 Oct 08, 2021
Fly DCS without a joystick

Intro Usage Delete all mouse view axis Install DCSEasyControlExports to your "Saved Games/DCS/" Path python DCSEasyControl/main.py Set DCS to F12 view

XuHao 36 Dec 27, 2022