CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system

Overview

CinnaMon


MIT_license


CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system. At its core, CinnaMon allows to study data drift between two given datasets. It is particularly useful in a monitoring context where the first dataset is the training (or validation) data and the second dataset is the production data.

⚡️ Quickstart

As a quick example, let's illustrate the use of CinnaMon on the breast cancer data where we voluntarily introduce some data drift.

Setup the data and build a model

>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from xgboost import XGBClassifier

# load breast cancer data
>>> dataset = datasets.load_breast_cancer()
>>> X = pd.DataFrame(dataset.data, columns = dataset.feature_names)
>>> y = dataset.target

# split data in train and valid dataset
>>> X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=2021)

# introduce some data drift in valid by filtering with 'worst symmetry' feature
>>> y_valid = y_valid[X_valid['worst symmetry'].values > 0.3]
>>> X_valid = X_valid.loc[X_valid['worst symmetry'].values > 0.3, :].copy()

# fit a XGBClassifier on the training data
>>> clf = XGBClassifier(use_label_encoder=False)
>>> clf.fit(X=X_train, y=y_train, verbose=10)

Initialize ModelDriftExplainer and fit it on train and validation data

>>> from cinnamon.drift import ModelDriftExplainer

# initialize a drift explainer with the built XGBClassifier and fit it on train
# and valid data
>>> drift_explainer = ModelDriftExplainer(model=clf)
>>> drift_explainer.fit(X1=X_train, X2=X_valid, y1=y_train, y2=y_valid)

Detect data drift by looking at main graphs and metrics

# Distribution of logit predictions
>>> drift_explainer.plot_prediction_drift(bins=15)

plot_prediction_drift

We can see on this graph that because of the data drift we introduced in validation data the distribution of predictions are different (they do not overlap well). We can also compute the corresponding drift metrics:

# Corresponding metrics
>>> drift_explainer.get_prediction_drift()
[{'mean_difference': -3.643428434667366,
  'wasserstein': 3.643428434667366,
  'kolmogorov_smirnov': KstestResult(statistic=0.2913775225333014, pvalue=0.00013914094110123454)}]

Comparing the distributions of predictions for two datasets is one of the main indicator we use in order to detect data drift. The two other indicators are:

  • distribution of the target (see get_target_drift)
  • performance metrics (see get_performance_metrics_drift)

Explain data drift by computing the drift values

Drift values can be thought as equivalent of feature importance but in terms of data drift.

# plot drift values
>>> drift_explainer.plot_tree_based_drift_values(n=7)

plot_drift_values

Here the feature worst symmetry is rightly identified as the one which contributes the most to the data drift.

More

See "notes" below to explore all the functionalities of CinnaMon.

🛠 Installation

CinnaMon is intended to work with Python 3.9 or above. Installation can be done with pip:

pip install cinnamon

🔗 Notes

  • The two main classes of CinnaMon are ModelDriftExplainer and AdversarialDriftExplainer

  • ModelDriftExplainer currently only support XGBoost models (both regression and classification are supported)

  • See notebooks in the examples/ directory to have an overview of all functionalities. Notably:

    These two notebooks also go deeper into the topic of how to correct data drift, making use of AdversarialDriftExplainer

  • See also the slide presentation of the CinnaMon library.

  • There is (yet) no formal documentation for CinnaMon but docstrings are up to date for the two main classes.

👍 Contributing

Check out the contribution section.

📝 License

CinnaMon is free and open-source software licensed under the MIT.

You might also like...
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

🔬 A curated list of awesome machine learning strategies & tools in financial market.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

Python Automated Machine Learning library for tabular data.
Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

This is a Machine Learning model which predicts the presence of Diabetes in Patients

Diabetes Disease Prediction This is a machine Learning mode which tries to determine if a person has a diabetes or not. Data The dataset is in comma s

Data science, Data manipulation and Machine learning package.
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

Data Version Control or DVC is an open-source tool for data science and machine learning projects
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

Comments
  • Some feedback and some questions

    Some feedback and some questions

    Hi!

    This looks like a great project! I have a few concerns about using a hypothesis based test for comparison of drift - reason being, how do you account for the multiple comparison's problem? https://en.wikipedia.org/wiki/Multiple_comparisons_problem

    You do get some more explanatory power by looking at the plots, to be sure. I was thinking maybe you could include some permutation tests to deal with this, instead of relying on KS? Here is a reference: http://sia.webpopix.org/statisticalTests2.html and here is some in Python: https://ericschles.github.io/cuny_intro_to_ds_book/12/1/AB_Testing.html?highlight=permutation (important to note even though this is my teaching resource, it is lifted from some content from berkeley).

    Anyway, great job!

    opened by EricSchles 3
  • error after trying to execute the command:

    error after trying to execute the command: "from cinnamon.drift import ModelDriftExplainer"

    [1I ] am getting the following error when trying to execute code from Quickstart or [breast_cancer_xgboost_binary_classif.ipynb] in a section containing "from cinnamon.drift import ModelDriftExplainer":

    ModuleNotFoundError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_10348/627594479.py in 1 # Initialize ModelDriftExplainer and fit it on train and validation data ----> 2 from cinnamon.drift import ModelDriftExplainer 3 4 # initialize a drift explainer with the built XGBClassifier and fit it on train 5 # and valid data

    ~\AppData\Roaming\Python\Python39\site-packages\cinnamon\drift_init_.py in 1 from .adversarial_drift_explainer import AdversarialDriftExplainer ----> 2 from .model_drift_explainer import ModelDriftExplainer

    ~\AppData\Roaming\Python\Python39\site-packages\cinnamon\drift\model_drift_explainer.py in 7 from ..model_parser.i_model_parser import IModelParser 8 from .adversarial_drift_explainer import AdversarialDriftExplainer ----> 9 from ..model_parser.xgboost_parser import XGBoostParser 10 11 from .drift_utils import compute_drift_num, plot_drift_num

    ~\AppData\Roaming\Python\Python39\site-packages\cinnamon\model_parser\xgboost_parser.py in 2 import pandas as pd 3 from typing import Tuple ----> 4 from .single_tree import BinaryTree 5 import xgboost 6 from .abstract_tree_ensemble_parser import AbstractTreeEnsembleParser

    ~\AppData\Roaming\Python\Python39\site-packages\cinnamon\model_parser\single_tree.py in 1 import numpy as np ----> 2 from treelib import Tree 3 from ..common.constants import TreeBasedDriftValueType 4 5 class BinaryTree:

    ModuleNotFoundError: No module named 'treelib'

    ​[2] When I'm executing the code chunk "# fit an XGBClassifier on the training data" from "Quickstart" I've got this warning:

    [20:53:12] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=6, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None)

    I use Python 3.8.8/ Win10 installed on the AMD Ryzen with integrated graphics (AMD). Environment: Anaconda

    opened by tomaszek0 2
  • build(deps): bump pillow from 8.4.0 to 9.0.0

    build(deps): bump pillow from 8.4.0 to 9.0.0

    Bumps pillow from 8.4.0 to 9.0.0.

    Release notes

    Sourced from pillow's releases.

    9.0.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.0.0 (2022-01-02)

    • Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

    • Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

    • Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

    • Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

    • Improved I;16 operations on big endian #5901 [radarhere]

    • Limit quantized palette to number of colors #5879 [radarhere]

    • Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

    • When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

    • Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

    • Added rounding when converting P and PA #5824 [radarhere]

    • Improved putdata() documentation and data handling #5910 [radarhere]

    • Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

    • Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

    • Added ImageShow support for xdg-open #5897 [m-shinder, radarhere]

    • Support 16-bit grayscale ImageQt conversion #5856 [cmbruns, radarhere]

    • Convert subsequent GIF frames to RGB or RGBA #5857 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • TypeError: predict() got an unexpected keyword argument 'iteration_range'

    TypeError: predict() got an unexpected keyword argument 'iteration_range'

    Hi cinnamon team, Firstly, thanks for bringing such a cool package!

    I was working with your package and I have come across the following error. Then, I checked your example notebook examples/boston_XGBoost_ModelDriftExplainer.ipynb, to be sure whether I used it correctly, but got the same error:

    TypeError: predict() got an unexpected keyword argument 'iteration_range'
    

    Screenshot 2022-03-11 at 00 26 44

    Could you please let me know how to overcome this issue (maybe I am using an obsolete version of a package)?

    Environment details:

    • macOS v.12.1
    • Python 3.8.8
    • cinnamon==0.1.2
    • xgboost==1.4.2

    Thanks for your help in advance!

    opened by furkanmtorun 0
Releases(0.2)
  • 0.2(Dec 9, 2022)

    Update to “ModelDriftExplainer”:

    • Add model agnostic support (deals with black box models / pipelines)
    • Add model specific support for CatBoost
    • Add support for categorical features
    • Add support for prediction_type = “class”

    Create a documentation website.

    Source code(tar.gz)
    Source code(zip)
Owner
Zelros
IA for Augmented Insurers
Zelros
Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022
LinearRegression2 Tvads and CarSales

LinearRegression2_Tvads_and_CarSales This project infers the insight that how the TV ads for cars and car Sales are being linked with each other. It i

Ashish Kumar Yadav 1 Dec 29, 2021
🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

Real-time water systems lab 416 Jan 06, 2023
A Lightweight Hyperparameter Optimization Tool 🚀

The mle-hyperopt package provides a simple and intuitive API for hyperparameter optimization of your Machine Learning Experiment (MLE) pipeline.

Robert Lange 137 Dec 02, 2022
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

moisey alaev 1 Dec 28, 2021
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022
Short PhD seminar on Machine Learning Security (Adversarial Machine Learning)

Short PhD seminar on Machine Learning Security (Adversarial Machine Learning)

141 Dec 27, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
Continuously evaluated, functional, incremental, time-series forecasting

timemachines Autonomous, univariate, k-step ahead time-series forecasting functions assigned Elo ratings You can: Use some of the functionality of a s

Peter Cotton 343 Jan 04, 2023
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
30 Days Of Machine Learning Using Pytorch

Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

Mayur 119 Nov 24, 2022
LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
A Python step-by-step primer for Machine Learning and Optimization

early-ML Presentation General Machine Learning tutorials A Python step-by-step primer for Machine Learning and Optimization This github repository gat

Dimitri Bettebghor 8 Dec 01, 2022
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

2 Jun 14, 2022