Confidence intervals for scikit-learn forest algorithms

Overview

forest-confidence-interval: Confidence intervals for Forest algorithms

Travis Status Coveralls Status CircleCI Status status

Forest algorithms are powerful ensemble methods for classification and regression. However, predictions from these algorithms do contain some amount of error. Prediction variability can illustrate how influential the training set is for producing the observed random forest predictions.

forest-confidence-interval is a Python module that adds a calculation of variance and computes confidence intervals to the basic functionality implemented in scikit-learn random forest regression or classification objects. The core functions calculate an in-bag and error bars for random forest objects.

Compatible with Python2.7 and Python3.6

This module is based on R code from Stefan Wager (see important links below) and is licensed under the MIT open source license (see LICENSE)

Important Links

scikit-learn - http://scikit-learn.org/

Stefan Wager's randomForestCI - https://github.com/swager/randomForestCI (deprecated in favor of grf: https://github.com/swager/grf)

Installation and Usage

Before installing the module you will need numpy, scipy and scikit-learn. Dependencies associated with the previous modules may need root privileges to install Consult the API Reference for documentation on core functionality

pip install numpy scipy scikit-learn

can also install dependencies with:

 pip install -r requirements.txt

To install forest-confidence-interval execute:

pip install forestci

or, if you are installing from the source code:

python setup.py install

If would like to install the development version of the software use:

pip install git+git://github.com/scikit-learn-contrib/forest-confidence-interval.git

Why use forest-confidence-interval?

Our software is designed for individuals using scikit-learn random forest objects that want to add estimates of uncertainty to random forest predictors. Prediction variability demonstrates how much the training set influences results and is important for estimating standard errors. forest-confidence-interval is a Python module for calculating variance and adding confidence intervals to the popular Python library scikit-learn. The software is compatible with both scikit-learn random forest regression or classification objects.

Examples

The examples (gallery below) demonstrates the package functionality with random forest classifiers and regression models. The regression example uses a popular UCI Machine Learning data set on cars while the classifier example simulates how to add measurements of uncertainty to tasks like predicting spam emails.

Examples gallery

Contributing

Contributions are very welcome, but we ask that contributors abide by the contributor covenant.

To report issues with the software, please post to the issue log Bug reports are also appreciated, please add them to the issue log after verifying that the issue does not already exist. Comments on existing issues are also welcome.

Please submit improvements as pull requests against the repo after verifying that the existing tests pass and any new code is well covered by unit tests. Please write code that complies with the Python style guide, PEP8.

E-mail Ariel Rokem, Kivan Polimis, or Bryna Hazelton if you have any questions, suggestions or feedback.

Testing

Requires installation of nose package. Tests are located in the forestci/tests folder and can be run with the nosetests command in the main directory.

Citation

Click on the JOSS status badge for the Journal of Open Source Software article on this project. The BibTeX citation for the JOSS article is below:

@article{polimisconfidence,
  title={Confidence Intervals for Random Forests in Python},
  author={Polimis, Kivan and Rokem, Ariel and Hazelton, Bryna},
  journal={Journal of Open Source Software},
  volume={2},
  number={1},
  year={2017}
}
Comments
  • ENH: Allow forestci to work on general Bagging estimators

    ENH: Allow forestci to work on general Bagging estimators

    Resolves #99

    This PR adds functionality to forestci.py to inspect the "forest" estimator to see if it is a random forest (i.e. inherits from BaseForest) or a bagging estimator (i.e. inherits from BaseBagging). There are some differences in the private attributes of these classes so the distinction is necessary. When the estimator is a random forest, all of the existing code applies. When it inherits from BaseBagging, we use the .estimators_samples_ attribute for the calc_inbag function. And when calibrating inside random_forest_error, it is also necessary to randomly permute the _seeds array attribute of new_forest. I've also added some tests for these new features.

    I believe this PR makes forestci work well with general bagging estimators. However, I would greatly appreciate it if @arokem, @kpolimis, @bhazelton could check my work here. Most importantly, is this sensible? I think I've made the APIs compatible but am I making a mistake in applying Wager's method to general bagging methods (and not exclusively to random forests)?

    opened by richford 7
  • Bug memory kws

    Bug memory kws

    Just tried out this package, looks like a great implementation.

    I ran this on a large dataset (much bigger than memory) and ran into the following problem that the keywords were not being passed along. Was there a reason for this?

    If not, small fix is in this PR.

    opened by owlas 7
  • negative V_IJ_unbiased

    negative V_IJ_unbiased

    Hi,

    first of all, great work, this is a great tool! I have a couple of questions based on issues I've encountered when playing with the package. Apologies if these reveal my misunderstanding rather than an actual issue with the coding.

    1. When running the confidence interval calculation on a forest I trained, I encounter negative values of the unbiased variances. Additionally, the more trees my forest has, the more of these negative values appear. Could there be some kind of bias overcorrection?

    2. The _bias_correction function in the module calculates n_var parameter, that it then applies to the bias correction vector. However, no such expression appears in Eqn. (7) of the Wagner et al. (2014), according to which the bias correction should be n_train_samples * boot_var / n_trees (using the variable names from the package code). Where does n_var come from?

    3. I don't see any parameter regulating the number of bootstrap draws. Even though O(n) draws should be enough to take case of the Monte Carlo noise, it should still be possible to control this somehow. If I change the n_samples parameter, this clashes with the pred matrix, which is fixed to the number of trees in the forest. How to regulate the number of draws?

    4. In fact, if I'm reading the paper right, the idea is to look at how the predictions from the individual trees change when using different bootstrap samples of the original data. That doesn't seem to be what the package is doing, which is using predictions from a single forest on a set of test data instead of predictions of multiple forests of a single new sample. Where is my understanding wrong?

    Thanks and again, let me know if what I'm asking is off-topic for here.

    Ondrej

    opened by ondrejiayc 7
  • MRG: Calibration with empirical Bayes.

    MRG: Calibration with empirical Bayes.

    This is the work of hzhao16 from #48, but without some large data files that got added into the history along the way. Also, several added PEP8 fixes, and more comprehensive testing.

    This extends and supersedes #48

    opened by arokem 6
  • Not compatible with SKLearn version 0.22.1

    Not compatible with SKLearn version 0.22.1

    A newer version of SciKit Learn modified _generate_sample_indices() to require an additional n_samples_bootstrap argument, thus the current version of the code will raise a TypeError: _generate_sample_indices() missing 1 required positional argument: 'n_samples_bootstrap' when running fci.random_forest_error(mpg_forest, mpg_X_train, mpg_X_test).

    opened by csanadpoda 4
  • Usage in practised application

    Usage in practised application

    Hi,

    Firstly, thanks for the amazing work! I just have a question that how we support to use the error bar? Specifically for the RandomForestClassifier. The example only uses the result for plotting ...

    Thanks and look forward to hearing from you

    opened by JIAZHEN 4
  • Error running plot_mpg notebook

    Error running plot_mpg notebook

    I ran the plot_mpg notebook code:

    
    # Regression Forest Example
    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.ensemble import RandomForestRegressor
    import sklearn.cross_validation as xval
    from sklearn.datasets.mldata import fetch_mldata
    import forestci as fci
    
    # retreive mpg data from machine learning library
    mpg_data = fetch_mldata('mpg')
    
    # separate mpg data into predictors and outcome variable
    mpg_X = mpg_data["data"]
    mpg_y = mpg_data["target"]
    
    # split mpg data into training and test set
    mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(
                                                       mpg_X, mpg_y,
                                                       test_size=0.25,
                                                       random_state=42
                                                       )
    
    # create RandomForestRegressor
    n_trees = 2000
    mpg_forest = RandomForestRegressor(n_estimators=n_trees, random_state=42)
    mpg_forest.fit(mpg_X_train, mpg_y_train)
    mpg_y_hat = mpg_forest.predict(mpg_X_test)
    
    # calculate inbag and unbiased variance
    mpg_inbag = fci.calc_inbag(mpg_X_train.shape[0], mpg_forest)
    mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
                                                mpg_X_test)
    
    # Plot error bars for predicted MPG using unbiased variance
    plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
    plt.plot([5, 45], [5, 45], '--')
    plt.xlabel('Reported MPG')
    plt.ylabel('Predicted MPG')
    plt.show()
    

    and got the following error:

    TypeError                                 Traceback (most recent call last)
    <ipython-input-2-a0d96d55b892> in <module>()
         30 mpg_inbag = fci.calc_inbag(mpg_X_train.shape[0], mpg_forest)
         31 mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
    ---> 32                                             mpg_X_test)
         33 
         34 # Plot error bars for predicted MPG using unbiased variance
    
    TypeError: random_forest_error() missing 1 required positional argument: 'X_test'
    
    My environment is Anaconda python 4.3.1.
    
    Charles
    
    opened by CBrauer 4
  • Receiving strange TypeError

    Receiving strange TypeError

    I have the following code:

    df = pd.read_csv('data.csv', header=0, engine='c')
    mat = df.as_matrix()
    X = mat[:, 1:]
    X_train, X_test = train_test_split(X, test_size = 0.2)
    variance = forestci.random_forest_error(model, X_train, X_test)
    

    When I run it, it throws the error TypeError: random_forest_error() takes exactly 4 arguments (3 given).

    However, there are only three non-optional arguments listed in the documentation. If I add a fourth argument for inbag, I then get an error saying that inbag is defined twice. Any ideas of what's causing this? I'm happy to write a PR if you point me towards the cause.

    opened by finbarrtimbers 4
  • Handle MultiOutput model

    Handle MultiOutput model

    Hi, I suggest this modification to handle with multi-output estimators. This will solve Issue https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/54, i.e., the oldest open issue on this repo!

    Scikit-Learn's RandomForestRegressor can automatically switch to a MultiOutput model if the y_train contains multiple targets. However ,forest-confidence-interval could not handle them.

    One solution would imply to compute and return a 2-dim array with the variance for each target, for each sample. However, this would break some past compatibility (because it would make sense to print a 2-d (1,N)-array even with one target) but especially, it would require an extensive check on all the tensors operations. What I propose here is to input y_output (int), telling the program which output to use. This may not be the most efficient solution, as there is some redundancy in running random_forest_error() if you want to run it for each output... but it is very intuitive to understand, totally back-compatible, and a simple modification.

    Thanks again for this nice project, to which I'm happy to contribute for the second time. I hope this gets merged soon.

    Daniele

    opened by danieleongari 3
  • Warning: sklearn.ensemble.forest module is deprecated in version 0.22

    Warning: sklearn.ensemble.forest module is deprecated in version 0.22

    Hi, when I use forstci, which is great, I get the following warning, which is harmless for now:

    The sklearn.ensemble.forest module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.

    It might hit us in the future

    opened by sq5rix 3
  • Error with `random_forest_error`

    Error with `random_forest_error`

    Submitting an error report here, just for record purposes.

    With the following line:

    pred_error = fci.random_forest_error(clf, X_train=X_train, X_test=X_test, inbag=None)
    

    I get the following error:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-60-79c18cb1c841> in <module>()
    ----> 1 pred_error = fci.random_forest_error(clf, X_train=X_train, X_test=X_test, inbag=None)
    
    ~/anaconda/envs/targetpred/lib/python3.6/site-packages/forestci/forestci.py in random_forest_error(forest, inbag, X_train, X_test)
        115     pred_centered = pred - pred_mean
        116     n_trees = forest.n_estimators
    --> 117     V_IJ = _core_computation(X_train, X_test, inbag, pred_centered, n_trees)
        118     V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
        119     return V_IJ_unbiased
    
    ~/anaconda/envs/targetpred/lib/python3.6/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees)
         57 
         58     for t_idx in range(n_trees):
    ---> 59         inbag_r = (inbag[:, t_idx] - 1).reshape(-1, 1)
         60         pred_c_r = pred_centered.T[t_idx].reshape(1, -1)
         61         cov_hat += np.dot(inbag_r, pred_c_r) / n_trees
    
    TypeError: 'NoneType' object is not subscriptable
    

    I am using version 0.1.0, installed from pip.

    I think a new release is required; after inspecting the source code, I'm seeing that inbag=None is no longer a required keyword argument (contrary to what my installed version is saying), and that inbag=None is handled correctly in the GitHub version (contrary to how my installed version is working).

    opened by ericmjl 3
  • New Release

    New Release

    Hello, would it be possible to create a new release to include #111 in a release version? :)
    I'm not a fan of having to pull git versions of packages.
    Thank you!

    opened by DasCapschen 1
  • Array dimensions incorrect for confidence intervals

    Array dimensions incorrect for confidence intervals

    Hi,

    I'm trying to create error estimates and am using RandomForestRegressor with bootstrapping enabled. I am using data with dimensions:

    x_test [10,13] x_train [90,13] y_test [10,2] y_train [90,2]

    I then generate errors using:

    y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
    
    

    However I get the error:

    Generating point estimates...
    [Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
    [Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
    [Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    /tmp/ipykernel_2626600/1096083143.py in <module>
    ----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
          2 print(point_estimates)
    
    /scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
        158         # Use the model to make predictions on new objects
        159         y_pred = self.model.predict(self.x_test)
    --> 160         y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
        161 
        162         # Update class variables
    
    ~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
        279     n_trees = forest.n_estimators
        280     V_IJ = _core_computation(
    --> 281         X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
        282     )
        283     V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
    
    ~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
        135     """
        136     if not memory_constrained:
    --> 137         return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
        138 
        139     if not memory_limit:
    
    <__array_function__ internals> in dot(*args, **kwargs)
    
    ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)
    

    Does anyone have any idea what is going wrong here?? Thanks!

    opened by ljaniurek 1
  • Benchmarking confidence intervals

    Benchmarking confidence intervals

    For my dataset, I tried correlating the CIs to absolute error on the test set, and didn't find a relationship. I do get a relationship if I use the standard deviation of the predictions from individual decision trees. Do you see this with other datasets?

    opened by cyrusmaher 1
  • Can this package be adapted to perform Thompson sampling?

    Can this package be adapted to perform Thompson sampling?

    I’m looking at using random forest regressors to perform hyperparameter tuning in a Bayesian optimization setup. While you can use the upper confidence bound to explore your state space, Thompson sampling performs better and eliminates the need for tuning the hyper-hyperparameter of the confidence interval used for selection. One solution is to obtain an empirical Bayesian posterior by training many random forest regressors on bootstrapped data, but this seems like overkill (ensembles of ensembles!). Would appreciate any input on the subject thank you! (For more discussion see this review of using CART decision trees to pull off the goal: https://arxiv.org/pdf/1706.04687.pdf)

    opened by douglasmason 0
  • Sum taken over wrong axis

    Sum taken over wrong axis

    Hi there,

    I believe the centered predictions are being computed incorrectly. Line 278 in forestci.py takes the average over the predictions, as opposed to the trees. The resulting shape of pred_mean is (forest.n_estimators,) when it should be (X_test.shape[0],). See below:

    https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/6d2a9c285b96bd415ad5ed03f37e517740a47fa2/forestci/forestci.py#L278

    Thanks for the great package otherwise! :)

    opened by bchugg 2
  • ValueError on multiple output problems

    ValueError on multiple output problems

    Training set is of the form (n_training_samples, n_features) = (14175.34) Testing set is of the form (n_testing_samples, n_features) = (4725,34) Running - forestci.random_forest_error(randomFor, X_train, X_test) Yields the following error;

    ValueError Traceback (most recent call last) in 21 print(X_test.shape) 22 mpg_V_IJ_unbiased = forestci.random_forest_error(randomFor, X_train, ---> 23 X_test) 24 hat = randomFor.predict(X_test) 25 print(' The score for is {}'.format(score[-13::]))

    ~\Anaconda3\lib\site-packages\forestci\forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit) 241 n_trees = forest.n_estimators 242 V_IJ = _core_computation(X_train, X_test, inbag, pred_centered, n_trees, --> 243 memory_constrained, memory_limit) 244 V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees) 245

    ~\Anaconda3\lib\site-packages\forestci\forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode) 110 """ 111 if not memory_constrained: --> 112 return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0) 113 114 if not memory_limit:

    <array_function internals> in dot(*args, **kwargs)

    ValueError: shapes (14175,700) and (700,4725,2) not aligned: 700 (dim 1) != 4725 (dim 1)

    opened by IguanasInPyjamas 1
Releases(0.6)
A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023
A model to predict steering torque fully end-to-end

torque_model The torque model is a spiritual successor to op-smart-torque, which was a project to train a neural network to control a car's steering f

Shane Smiskol 4 Jun 03, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 09, 2022
Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics

Facebook Research 4.1k Dec 29, 2022
Tutorial for Decision Threshold In Machine Learning.

Decision-Threshold-ML Tutorial for improve skills: 'Decision Threshold In Machine Learning' (from GeeksforGeeks) by Marcus Mariano For more informatio

0 Jan 20, 2022
Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

Panagiotis (Panos) Mavritsakis 4 Sep 22, 2022
This is a Machine Learning model which predicts the presence of Diabetes in Patients

Diabetes Disease Prediction This is a machine Learning mode which tries to determine if a person has a diabetes or not. Data The dataset is in comma s

Edem Gold 4 Mar 16, 2022
Predict the demand for electricity (R) - FRENCH

06.demand-electricity Predict the demand for electricity (R) - FRENCH Prédisez la demande en électricité Prérequis Pour effectuer ce projet, vous devr

1 Feb 13, 2022
Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

hieulmsc 3 Dec 18, 2021
A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

Sanjeet N. Dasharath 3 Feb 15, 2022
机器学习检测webshell

ai-webshell-detect 机器学习检测webshell,利用textcnn+简单二分类网络,基于keras,花了七天 检测原理: 从文件熵 文件长度 文件语句提取出特征,然后文件熵与长度送入二分类网络,文件语句送入textcnn 项目原理,介绍,怎么做出来的

Huoji's 56 Dec 14, 2022
Time-series momentum for momentum investing strategy

Time-series-momentum Time-series momentum strategy. You can use the data_analysis.py file to find out the best trigger and window for a given asset an

Victor Caldeira 3 Jun 18, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah) Xue 0 Jan 31, 2022
A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement.

Organic Alkalinity Sausage Machine A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement. Getting started To mak

Charles Turner 1 Feb 01, 2022
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
The code from the Machine Learning Bookcamp book and a free course based on the book

The code from the Machine Learning Bookcamp book and a free course based on the book

Alexey Grigorev 5.5k Jan 09, 2023