dirty_cat is a Python module for machine-learning on dirty categorical variables.

Overview

dirty_cat

dirty_cat logo

py_ver pypi_var pypi_dl codecov circleci

dirty_cat is a Python module for machine-learning on dirty categorical variables.

Website: https://dirty-cat.github.io/

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

Installation

Dependencies

dirty_cat requires:

  • Python (>= 3.6)
  • NumPy (>= 1.16)
  • SciPy (>= 1.2)
  • scikit-learn (>= 0.21.0)
  • pandas (>= 1.1.5)

Optional dependency:

  • python-Levenshtein for faster edit distances (not used for the n-gram distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip

pip install -U --user dirty_cat

Other implementations

References

[1] Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
[2] Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.
Comments
  • DOC first version of the non-linear online example

    DOC first version of the non-linear online example

    Example about non-linear starts stabilizing IMO, so I can start taking feedback.

    In short, the example shows a classification problem for the traffic_violations dataset.

    5 types of models are implemented:

    • SimilarityEncoder + SVC
    • SimilarityEncoder + RBFSampler + SVC
    • SimilarityEncoder + RBFSampler + SGD
    • SimilarityEncoder fitted on a subset of the data + RBFSampler + SGD
    • SimilarityEncoder fitted on a subset of the data + RBFSampler + online SGD

    Training set sizes still dummy to make CI pass fast. For now, the scripts takes around 15 sec to run. RBFSampler can easily be switched in favor of Nystroem.

    opened by pierreglaser 21
  • Better fuzzy join

    Better fuzzy join

    I wanted to improve the fuzzy_join example to first use fuzzy_join as it would be used by people using pandas outside of a predictive analysis, and then do the predictive analysis: hence separating X and y later.

    I hit a problem that dtypes are not maintained by the merge. I created a simple failing test to illustrate the problem.

    opened by GaelVaroquaux 20
  • Refactor of the fetching system

    Refactor of the fetching system

    Follows #147


    This is a refactor of the dataset fetching system. Historically, datasets were fetched from various websites.

    Our aim with this update is to only use OpenML.org's API, through Scikit-learn's fetch_openml() function. This allows us to have a much more reliable and unified interface, and avoids losing access to datasets due to deletion, renaming, etc.

    For instance, with the current system, 4 on the 7 datasets are unavailable (403 Access Denied, website down, etc.).

    The user-interface stays the same with the functions fetch_*() (e.g fetch_open_payments()), still returning a similar dictionary. The major difference is that this dictionary returns, among other information, a path, where a CSV file is located, and must be loaded (using for instance pandas' read_csv() function).


    TL;DR of the previous thread: the way fetch_openml() is used here makes pandas a requirement.

    enhancement 
    opened by LilianBoulard 16
  • Adding Gamma poisson factorization

    Adding Gamma poisson factorization

    Changes:

    • Added gamma_poisson_factorization.py which implements online Gamma-Poisson factorization for encoding string variables.
    • Added the corresponding tests intest_gamma_poisson_factorization.
    • Modified examples 02 and 03 to include this method.
    • Updated CHANGES.rst and index.rst to describe this new method.
    opened by alexis-cvetkov 15
  • Add support for missing values in the encoders

    Add support for missing values in the encoders

    Encoding a missing value as a vector of zeros is a reasonable thing. Our theoretical study (https://arxiv.org/abs/1902.06931) shows that the most important thing is to encode them in a special value that can be later picked up by the supervised step.

    Our encoders should have an option that controls whether missing values are encoded as zeros or an error is raised (following scikit-learn encoders).

    opened by GaelVaroquaux 14
  • Maintenance

    Maintenance

    This PR aims at improving the overall quality of the code and doc.

    It has several purposes:

    • Correct typos
    • Reword unclear sentences
    • Minor updates to the doc
    • Some minor structural improvements, such as moving some functions to suiting modules
    • Rename some variables for better readability
    • Use modern language features for better readability and performance
    • Simplify the code, while leaving functionalities intact (no bug-fixes)
    • Make extensive use of type hinting, which serves two purposes:
      • Make the code easier to work with, especially when working with IDEs that support type hinting
      • Make the functions more efficient and less error-prone when using tools that enforce types, such as MyPy

    In general, these are rather small modifications for which making unique PRs would be kind of overkill.

    opened by LilianBoulard 13
  • ENH MinHash parallel

    ENH MinHash parallel

    Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.

    We no longer use the self.hash_dict attribute, so the fit method does nothing now.

    opened by LeoGrin 13
  • Apply minhash_encoder to more than 1024 categories returns -1

    Apply minhash_encoder to more than 1024 categories returns -1

    Hy all, I am trying to apply minhash_encoder to a somewhat large dataset of strings (~200k distinct). I was testing my code with 10 strings, and it was running fine. But when I tested using all dataset, most of the strings were represented as all '-1' vectors. I took a look at the source code and find this line inside 'minhash_encoder.py', that maybe is causing the problem: self.hash_dict = LRUDict(capacity=2**10) Not sure why this is used, but I checked with 1025 strings, and only the first one returns -1. This encoder should work with a lot more variables, right?

    Code to replicate:

    from dirty_cat import MinHashEncoder
    import random
    import string
    
    def get_random_string(length):
        letters = string.ascii_lowercase
        result_str = ''.join(random.choice(letters) for i in range(length))
        return result_str
    
    # 1024 categories -> all ok
    raw_data = [get_random_string(10) for x in range(1024)]
    hash_encoder = MinHashEncoder(n_components=10)
    transformed_values = hash_encoder.fit_transform(raw_data)
    print(transformed_values)
    
    # 1025 categories -> first represented as -1's
    raw_data = [get_random_string(10) for x in range(1025)]
    hash_encoder = MinHashEncoder(n_components=10)
    transformed_values = hash_encoder.fit_transform(raw_data)
    print(transformed_values)
    
    opened by jp-varela 10
  • AttributeError: 'tuple' object has no attribute 'shape'

    AttributeError: 'tuple' object has no attribute 'shape'

    Hello!

    I was trying to reproduce "Investigating dirty categories" (https://dirty-cat.github.io/stable/auto_examples/01_investigating_dirty_categories.html#sphx-glr-auto-examples-01-investigating-dirty-categories-py) and got this error: AttributeError: 'tuple' object has no attribute 'shape'.

    Log says it is in line 241, in fit n_samples, n_features = X.shape

    Am I doing something wrong or is it a issue?

    I'm on python 3.7.

    Thanks

    opened by AC-Meira 10
  • ENH accelerate ngram_similarity

    ENH accelerate ngram_similarity

    Accelerate the computation in SimilarityEncoder.transform by:

    • Parallelizing the similarity computations using joblib
    • Computing the count vectors of the vocabulary at fitting time and not at transform time.
    opened by pierreglaser 10
  • Super Vectorizer transforms data to sparse matrices

    Super Vectorizer transforms data to sparse matrices

    Actual behavior

    The Super Vectorizer transform and fit_transform methods have the following rule: "If any result is a sparse matrix, everything will be converted to sparse matrices." This is the scipy.sparse.csr.csr_matrix type.

    However, this type is not commonly accepted for further analysis. For instance, when applying a cross_val_score() we need to first make the result an array to be able to apply the method. This makes also the direct introduction of pipelines in cross_val_score() impossible, as an error will appear.

    Expected behavior

    Sparse matrices happen when the encoded variable has a lot of categories. Maybe introduce a sparse=True parameter, just like for the sklearn OHE, that will return sparse matrix if set True and array if False.

    Easy code to reproduce bug

    import pandas as pd
    import numpy as np
    
    from sklearn.model_selection import cross_val_score
    from sklearn.pipeline import make_pipeline
    from sklearn.experimental import enable_hist_gradient_boosting
    # now you can import the HGBR from ensemble
    from sklearn.ensemble import HistGradientBoostingRegressor
    from dirty_cat import SuperVectorizer
    
    np.random.seed(444) 
    col1 = np.random.choice(  
         a=[0, 1, 2, 3],  
         size=50,  
         p=[0.4, 0.3, 0.2, 0.1])
    
    col2 = np.random.choice(  
         a=['a', 'b', 'c'],  
         size=50,  
         p=[0.4, 0.4, 0.2])
    
    results = np.random.uniform( 
         size=50)
    
    df = pd.DataFrame(np.array([col1, col2, results])).transpose()
    
    X = df.drop(columns=[2])
    y = df[2]
    
    sup_vec = SuperVectorizer()
    
    pipeline = make_pipeline(
        SuperVectorizer(auto_cast=True, sparse_threshold=0.3),
        HistGradientBoostingRegressor()
    )
    
    cross_val_score(pipeline, X, y)
    
    bug 
    opened by jovan-stojanovic 9
  • Hashing vectorizer in fuzzy join

    Hashing vectorizer in fuzzy join

    Following #446 (which seems to show that using HashingVectorizer is almost always faster than using CountVectorizer, without any apparent accuracy tradeoff) and discussion, this PR adds a vectorizer parameter to the fuzzy_join function, which defaults to hashing, i.e using HashingVectorizer.

    Replaces #420.

    I think someone should check my benchmark in #446 before we consider merging this PR.

    enhancement No Changelog Needed 
    opened by LeoGrin 1
  • Benchmark fuzzy join minhash

    Benchmark fuzzy join minhash

    • Add the possibility for the benchmark function to return a dictionnary, which is added to the results by the monitor decorator • Benchmark different encoders for fuzzy_join (issue #418, related to #420) . It seems that using the HashingVectorizer instead of CountVectorizer is always better (no cost for f1 score, and almost always faster, see plot). If the user want to tradeoff performance for speed, it seems better to play with the ngram_range than changing the encoder. Therefore I recommend to just use the HashingVectorizer for fuzzy_join, instead of using it only for big datasets like in #420.

    Looking forward to hearing what other people think! image

    opened by LeoGrin 1
  • Encoders do not raise parameter Value Error at initialisation

    Encoders do not raise parameter Value Error at initialisation

    SimilarityEncoder does not raise parameter Value Error at initialisation. So the user realise there is a problem only after trying to fit the encoder.

    dirty_cat version:

    Expected behavior:

    SimilarityEncoder(handle_unknown='blabla')
    ___________________________________________________________________________________
    ValueError: Got handle_unknown='blabla', but expected any of {'error', 'ignore'}. 
    

    Observed behavior:

    SimilarityEncoder(handle_unknown='blabla')
    _______________________________________________
    # No errors
    
    bug 
    opened by jovan-stojanovic 4
  • Various minor style improvements

    Various minor style improvements

    Sorry for the annoying to review PR! It's a bunch of changes I had in a leftover branch. Thought it would still be useful to push. Some changes are redundant with #426.

    Documentation No Changelog Needed 
    opened by LilianBoulard 0
Releases(0.3.0)
  • 0.3.0(Sep 12, 2022)

    What's Changed

    Major changes

    • New encoder: DatetimeEncoder can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the SuperVectorizer for datetime columns.

    • The SuperVectorizer has seen some major improvements and bug fixes

      • Fixes the automatic casting logic in transform.
      • Behavior change To avoid dimensionality explosion when a feature has two unique values, the default encoder (OneHotEncoder) now drops one of the two vectors (see parameter drop="if_binary").
      • fit_transform and transform can now return unencoded features, like the ColumnTransformer's behavior. Previously, a RuntimeError was raised.
    • Backward-incompatible change in the SuperVectorizer: to apply remainder to features (with the *_transformer parameters), the value 'remainder' must be passed, instead of None in previous versions. None now indicates that we want to use the default transformer.

    • Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required.

    • Bumped minimum dependencies:

      • sklearn>=0.23
      • scipy>=1.4.0
      • numpy>=1.17.3
      • pandas>=1.2.0
    • Dropped support for Jaro, Jaro-Winkler and Levenshtein distances. The SimilarityEncoder now exclusively uses ngram for similarities, and the similarity parameter is deprecated. It will be removed in 0.5.

    Notes

    • The transformers_ attribute of the SuperVectorizer now contains column names instead of column indices for the "remainder" columns.

    Full Changelog: https://github.com/dirty-cat/dirty_cat/compare/0.2.0...0.3.0

    Source code(tar.gz)
    Source code(zip)
  • 0.3.0b1(Sep 9, 2022)

  • 0.2.0(Oct 13, 2021)

    What's Changed

    Major changes

    • Bump minimum dependencies:

      • Python (>= 3.6)
      • NumPy (>= 1.16)
      • SciPy (>= 1.2)
      • scikit-learn (>= 0.20.0)
    • SuperVectorizer: Added automatic transform through the :class:SuperVectorizer class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's ColumnTransformer simpler to use on heterogeneous pandas DataFrame.

    • Backward incompatible change to GapEncoder: The GapEncoder now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent GapEncoder models, and are then concatenated into a single matrix.

    • Backward incompatible change to MinHashEncoder: The MinHashEncoder now only supports two dimensional inputs of shape (N_samples, 1).

    • Bump minimum dependencies:

      • Python (>= 3.6)
      • NumPy (>= 1.16)
      • SciPy (>= 1.2)
      • scikit-learn (>= 0.21.0)
      • pandas (>= 1.1.5) ! NEW REQUIREMENT !
    • datasets.fetching - backward-incompatible changes to the example datasets fetchers:

      • The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
      • The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetching.fetch_* for more information.
      • The example notebooks were updated to reflect these changes.

    Minor changes

    • Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.

    • Updated handle_missing parameters:

      • GapEncoder: the default value "zero_impute" becomes "empty_impute" (see doc).
      • MinHashEncoder: the default value "" becomes "zero_impute" (see doc).
    • Several bug-fixes

    Full Changelog: https://github.com/dirty-cat/dirty_cat/compare/0.1.0...0.2.0

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0a1(Jul 20, 2021)

Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 06, 2023
Python package for concise, transparent, and accurate predictive modeling

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. 📚 docs • 📖 demo notebooks Modern

Chandan Singh 983 Jan 01, 2023
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
ZenML 🙏: MLOps framework to create reproducible ML pipelines for production machine learning.

ZenML is an extensible, open-source MLOps framework to create production-ready machine learning pipelines. It has a simple, flexible syntax, is cloud and tool agnostic, and has interfaces/abstraction

ZenML 2.6k Jan 08, 2023
Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

sklearn-evaluation Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking, and Jupyter notebook analysis. Suppo

Eduardo Blancas 354 Dec 31, 2022
Drug prediction

I have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Dr

Khazar 1 Jan 28, 2022
ML-powered Loan-Marketer Customer Filtering Engine

In Loan-Marketing business employees are required to call the user's to buy loans of several fields and in several magnitudes. If employees are calling everybody in the network it is also very length

Sagnik Roy 13 Jul 02, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Dual Adaptive Sampling for Machine Learning Interatomic potential.

DAS Dual Adaptive Sampling for Machine Learning Interatomic potential. How to cite If you use this code in your research, please cite this using: Hong

6 Jul 06, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 03, 2022
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

27 Aug 19, 2022
PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing va

Wenjie Du 179 Dec 31, 2022
Xeasy-ml is a packaged machine learning framework.

xeasy-ml 1. What is xeasy-ml Xeasy-ml is a packaged machine learning framework. It allows a beginner to quickly build a machine learning model and use

9 Mar 14, 2022
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022
Random Forest Classification for Neural Subtypes

Random Forest classifier for neural subtypes extracted from extracellular recordings from human brain organoids.

Michael Zabolocki 1 Jan 31, 2022