dirty_cat is a Python module for machine-learning on dirty categorical variables.

Last update: Dec 29, 2022

Overview

dirty_cat

dirty_cat is a Python module for machine-learning on dirty categorical variables.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

Installation

Dependencies

dirty_cat requires:

Python (>= 3.6)
NumPy (>= 1.16)
SciPy (>= 1.2)
scikit-learn (>= 0.21.0)
pandas (>= 1.1.5)

Optional dependency:

python-Levenshtein for faster edit distances (not used for the n-gram distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip

pip install -U --user dirty_cat

Other implementations

Spark ML: https://github.com/rakutentech/spark-dirty-cat

References

[1]	Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

[2]	Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Comments

DOC first version of the non-linear online example
Example about non-linear starts stabilizing IMO, so I can start taking feedback.

In short, the example shows a classification problem for the traffic_violations dataset.

5 types of models are implemented:

SimilarityEncoder + SVC

SimilarityEncoder + RBFSampler + SVC

SimilarityEncoder + RBFSampler + SGD

SimilarityEncoder fitted on a subset of the data + RBFSampler + SGD

SimilarityEncoder fitted on a subset of the data + RBFSampler + online SGD

Training set sizes still dummy to make CI pass fast. For now, the scripts takes around 15 sec to run. RBFSampler can easily be switched in favor of Nystroem.
opened by pierreglaser 21
Better fuzzy join

I wanted to improve the fuzzy_join example to first use fuzzy_join as it would be used by people using pandas outside of a predictive analysis, and then do the predictive analysis: hence separating X and y later.

I hit a problem that dtypes are not maintained by the merge. I created a simple failing test to illustrate the problem.

opened by GaelVaroquaux 20
Refactor of the fetching system

Follows #147

This is a refactor of the dataset fetching system. Historically, datasets were fetched from various websites.

Our aim with this update is to only use OpenML.org's API, through Scikit-learn's fetch_openml() function. This allows us to have a much more reliable and unified interface, and avoids losing access to datasets due to deletion, renaming, etc.

For instance, with the current system, 4 on the 7 datasets are unavailable (403 Access Denied, website down, etc.).

The user-interface stays the same with the functions fetch_*() (e.g fetch_open_payments()), still returning a similar dictionary. The major difference is that this dictionary returns, among other information, a path, where a CSV file is located, and must be loaded (using for instance pandas' read_csv() function).

TL;DR of the previous thread: the way fetch_openml() is used here makes pandas a requirement.
enhancement

opened by LilianBoulard 16
Adding Gamma poisson factorization
Changes:

Added gamma_poisson_factorization.py which implements online Gamma-Poisson factorization for encoding string variables.

Added the corresponding tests intest_gamma_poisson_factorization.

Modified examples 02 and 03 to include this method.

Updated CHANGES.rst and index.rst to describe this new method.
opened by alexis-cvetkov 15
Add support for missing values in the encoders

Encoding a missing value as a vector of zeros is a reasonable thing. Our theoretical study (https://arxiv.org/abs/1902.06931) shows that the most important thing is to encode them in a special value that can be later picked up by the supervised step.

Our encoders should have an option that controls whether missing values are encoded as zeros or an error is raised (following scikit-learn encoders).

opened by GaelVaroquaux 14
Maintenance
This PR aims at improving the overall quality of the code and doc.

It has several purposes:

Correct typos

Reword unclear sentences

Minor updates to the doc

Some minor structural improvements, such as moving some functions to suiting modules

Rename some variables for better readability

Use modern language features for better readability and performance

Simplify the code, while leaving functionalities intact (no bug-fixes)

Make extensive use of type hinting, which serves two purposes:

Make the code easier to work with, especially when working with IDEs that support type hinting

Make the functions more efficient and less error-prone when using tools that enforce types, such as MyPy

In general, these are rather small modifications for which making unique PRs would be kind of overkill.
opened by LilianBoulard 13
ENH MinHash parallel

Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.

We no longer use the self.hash_dict attribute, so the fit method does nothing now.

opened by LeoGrin 13

Apply minhash_encoder to more than 1024 categories returns -1

Hy all, I am trying to apply minhash_encoder to a somewhat large dataset of strings (~200k distinct). I was testing my code with 10 strings, and it was running fine. But when I tested using all dataset, most of the strings were represented as all '-1' vectors. I took a look at the source code and find this line inside 'minhash_encoder.py', that maybe is causing the problem: self.hash_dict = LRUDict(capacity=2**10) Not sure why this is used, but I checked with 1025 strings, and only the first one returns -1. This encoder should work with a lot more variables, right?

Code to replicate:

from dirty_cat import MinHashEncoder
import random
import string

def get_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

# 1024 categories -> all ok
raw_data = [get_random_string(10) for x in range(1024)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

# 1025 categories -> first represented as -1's
raw_data = [get_random_string(10) for x in range(1025)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

opened by jp-varela 10

AttributeError: 'tuple' object has no attribute 'shape'

Hello!

I was trying to reproduce "Investigating dirty categories" (https://dirty-cat.github.io/stable/auto_examples/01_investigating_dirty_categories.html#sphx-glr-auto-examples-01-investigating-dirty-categories-py) and got this error: AttributeError: 'tuple' object has no attribute 'shape'.

Log says it is in line 241, in fit n_samples, n_features = X.shape

Am I doing something wrong or is it a issue?

I'm on python 3.7.

Thanks

opened by AC-Meira 10
ENH accelerate ngram_similarity
Accelerate the computation in SimilarityEncoder.transform by:

Parallelizing the similarity computations using joblib

Computing the count vectors of the vocabulary at fitting time and not at transform time.
opened by pierreglaser 10

Super Vectorizer transforms data to sparse matrices

Actual behavior

The Super Vectorizer transform and fit_transform methods have the following rule: "If any result is a sparse matrix, everything will be converted to sparse matrices." This is the scipy.sparse.csr.csr_matrix type.

However, this type is not commonly accepted for further analysis. For instance, when applying a cross_val_score() we need to first make the result an array to be able to apply the method. This makes also the direct introduction of pipelines in cross_val_score() impossible, as an error will appear.

Expected behavior

Sparse matrices happen when the encoded variable has a lot of categories. Maybe introduce a sparse=True parameter, just like for the sklearn OHE, that will return sparse matrix if set True and array if False.

Easy code to reproduce bug

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import the HGBR from ensemble
from sklearn.ensemble import HistGradientBoostingRegressor
from dirty_cat import SuperVectorizer

np.random.seed(444) 
col1 = np.random.choice(  
     a=[0, 1, 2, 3],  
     size=50,  
     p=[0.4, 0.3, 0.2, 0.1])

col2 = np.random.choice(  
     a=['a', 'b', 'c'],  
     size=50,  
     p=[0.4, 0.4, 0.2])

results = np.random.uniform( 
     size=50)

df = pd.DataFrame(np.array([col1, col2, results])).transpose()

X = df.drop(columns=[2])
y = df[2]

sup_vec = SuperVectorizer()

pipeline = make_pipeline(
    SuperVectorizer(auto_cast=True, sparse_threshold=0.3),
    HistGradientBoostingRegressor()
)

cross_val_score(pipeline, X, y)

bug

opened by jovan-stojanovic 9

Hashing vectorizer in fuzzy join

Following #446 (which seems to show that using HashingVectorizer is almost always faster than using CountVectorizer, without any apparent accuracy tradeoff) and discussion, this PR adds a vectorizer parameter to the fuzzy_join function, which defaults to hashing, i.e using HashingVectorizer.

Replaces #420.

I think someone should check my benchmark in #446 before we consider merging this PR.
enhancement No Changelog Needed

opened by LeoGrin 1
Benchmark fuzzy join minhash

• Add the possibility for the benchmark function to return a dictionnary, which is added to the results by the monitor decorator • Benchmark different encoders for fuzzy_join (issue #418, related to #420) . It seems that using the HashingVectorizer instead of CountVectorizer is always better (no cost for f1 score, and almost always faster, see plot). If the user want to tradeoff performance for speed, it seems better to play with the ngram_range than changing the encoder. Therefore I recommend to just use the HashingVectorizer for fuzzy_join, instead of using it only for big datasets like in #420.

Looking forward to hearing what other people think!

opened by LeoGrin 1

Encoders do not raise parameter Value Error at initialisation

SimilarityEncoder does not raise parameter Value Error at initialisation. So the user realise there is a problem only after trying to fit the encoder.

dirty_cat version:

Expected behavior:

SimilarityEncoder(handle_unknown='blabla')
___________________________________________________________________________________
ValueError: Got handle_unknown='blabla', but expected any of {'error', 'ignore'}.

Observed behavior:

SimilarityEncoder(handle_unknown='blabla')
_______________________________________________
# No errors

bug

opened by jovan-stojanovic 4

Various minor style improvements

Sorry for the annoying to review PR! It's a bunch of changes I had in a leftover branch. Thought it would still be useful to push. Some changes are redundant with #426.
Documentation No Changelog Needed

opened by LilianBoulard 0

Releases(0.3.0)

0.3.0(Sep 12, 2022)
What's Changed

Major changes

New encoder: DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the SuperVectorizer for datetime columns.

The SuperVectorizer has seen some major improvements and bug fixes

Fixes the automatic casting logic in transform.

Behavior change To avoid dimensionality explosion when a feature has two unique values, the default encoder (OneHotEncoder) now drops one of the two vectors (see parameter drop="if_binary").

fit_transform and transform can now return unencoded features, like the ColumnTransformer's behavior. Previously, a RuntimeError was raised.

Backward-incompatible change in the SuperVectorizer: to apply remainder to features (with the *_transformer parameters), the value 'remainder' must be passed, instead of None in previous versions. None now indicates that we want to use the default transformer.

Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required.

Bumped minimum dependencies:

sklearn>=0.23

scipy>=1.4.0

numpy>=1.17.3

pandas>=1.2.0

Dropped support for Jaro, Jaro-Winkler and Levenshtein distances. The SimilarityEncoder now exclusively uses ngram for similarities, and the similarity parameter is deprecated. It will be removed in 0.5.

Notes

The transformers_ attribute of the SuperVectorizer now contains column names instead of column indices for the "remainder" columns.

Full Changelog: https://github.com/dirty-cat/dirty_cat/compare/0.2.0...0.3.0
Source code(tar.gz)
Source code(zip)
0.3.0b1(Sep 9, 2022)

What's Changed

Preview release for dirty_cat 0.3.0. Changelog has been moved over there.

Full Changelog: https://github.com/dirty-cat/dirty_cat/compare/0.2.0...0.3.0b1
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 13, 2021)
What's Changed

Major changes

Bump minimum dependencies:

Python (>= 3.6)

NumPy (>= 1.16)

SciPy (>= 1.2)

scikit-learn (>= 0.20.0)

SuperVectorizer: Added automatic transform through the :class:SuperVectorizer class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's ColumnTransformer simpler to use on heterogeneous pandas DataFrame.

Backward incompatible change to GapEncoder: The GapEncoder now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent GapEncoder models, and are then concatenated into a single matrix.

Backward incompatible change to MinHashEncoder: The MinHashEncoder now only supports two dimensional inputs of shape (N_samples, 1).

Bump minimum dependencies:

Python (>= 3.6)

NumPy (>= 1.16)

SciPy (>= 1.2)

scikit-learn (>= 0.21.0)

pandas (>= 1.1.5) ! NEW REQUIREMENT !

datasets.fetching - backward-incompatible changes to the example datasets fetchers:

The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.

The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetching.fetch_* for more information.

The example notebooks were updated to reflect these changes.

Minor changes

Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.

Updated handle_missing parameters:

GapEncoder: the default value "zero_impute" becomes "empty_impute" (see doc).

MinHashEncoder: the default value "" becomes "zero_impute" (see doc).

Several bug-fixes

Full Changelog: https://github.com/dirty-cat/dirty_cat/compare/0.1.0...0.2.0
Source code(tar.gz)
Source code(zip)
0.2.0a1(Jul 20, 2021)

Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://dirty-cat.github.io/

Uses WiFi signals :signal_strength: and machine learning to predict where you are

Uses WiFi signals and machine learning (sklearn's RandomForest) to predict where you are. Even works for small distances like 2-10 meters.

5k Jan 09, 2023

🌊 River is a Python library for online machine learning.

River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on strea

4k Jan 03, 2023

Nevergrad - A gradient-free optimization platform

Nevergrad - A gradient-free optimization platform nevergrad is a Python 3.6+ library. It can be installed with: pip install nevergrad More installati

3.4k Jan 08, 2023

Getting Profit and Loss Make Easy From Binance

Getting Profit and Loss Make Easy From Binance I have been in Binance Automated Trading for some time and have generated a lot of transaction records,

17 Dec 21, 2022

Firebase + Cloudrun + Machine learning

A simple end to end consumer lending decision engine powered by Google Cloud Platform (firebase hosting and cloudrun)

8 Aug 16, 2022

Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

1 Jan 19, 2022

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

1.1k Jan 08, 2023

A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

2 Dec 10, 2021

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 03, 2023

PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

2 May 20, 2022

Confidence intervals for scikit-learn forest algorithms

forest-confidence-interval: Confidence intervals for Forest algorithms Forest algorithms are powerful ensemble methods for classification and regressi

272 Dec 01, 2022

Combines Bayesian analyses from many datasets.

PosteriorStacker Combines Bayesian analyses from many datasets. Introduction Method Tutorial Output plot and files Introduction Fitting a model to a d

19 Feb 13, 2022

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

ARM This jupyter notebook project was completed by me and my friend using the dataset from Kaggle. The world Happiness 2017, which ranks 155 countries

1 Jan 23, 2022

Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

81 Oct 18, 2022

A webpage that utilizes machine learning to extract sentiments from tweets.

Tweets_Classification_Webpage The goal of this project is to be able to predict what rating customers on social media platforms would give to products

1 Dec 30, 2021

Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Scikit-Garden or skgarden (pronounced as skarden) is a garden for Scikit-Learn compatible decision trees and forests.

260 Dec 21, 2022

Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

3 Dec 18, 2021

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

73 Dec 01, 2022

K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

1 Dec 13, 2021

A flexible CTF contest platform for coming PKU GeekGame events

Project Guiding Star: the Backend A flexible CTF contest platform for coming PKU GeekGame events Still in early development Highlights Not configurabl

14 Dec 15, 2022

dirty_cat is a Python module for machine-learning on dirty categorical variables.

Related tags

Overview

dirty_cat

Installation

Dependencies

User installation

Other implementations

References

Comments

Actual behavior

Expected behavior

Easy code to reproduce bug

Releases(0.3.0)

0.3.0(Sep 12, 2022)

What's Changed

Major changes

Notes

0.3.0b1(Sep 9, 2022)

What's Changed

0.2.0(Oct 13, 2021)

What's Changed

Major changes

Minor changes

0.2.0a1(Jul 20, 2021)

Owner

Uses WiFi signals :signal_strength: and machine learning to predict where you are

🌊 River is a Python library for online machine learning.

Nevergrad - A gradient-free optimization platform

Getting Profit and Loss Make Easy From Binance

Firebase + Cloudrun + Machine learning

Mortality risk prediction for COVID-19 patients using XGBoost models

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A collection of neat and practical data science and machine learning projects

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

PyHarmonize: Adding harmony lines to recorded melodies in Python

Confidence intervals for scikit-learn forest algorithms

Combines Bayesian analyses from many datasets.

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

A webpage that utilizes machine learning to extract sentiments from tweets.

Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Avocado hass time series vs predict price

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

K-Means clusternig example with Python and Scikit-learn

A flexible CTF contest platform for coming PKU GeekGame events