Stacked Generalization (Ensemble Learning)

Overview

Stacking (stacked generalization)

PyPI version license

Overview

ikki407/stacking - Simple and useful stacking library, written in Python.

User can use models of scikit-learn, XGboost, and Keras for stacking.
As a feature of this library, all out-of-fold predictions can be saved for further analisys after training.

Description

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. The basic idea is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

This blog is very helpful to understand stacking and ensemble learning.

Usage

See working example:

To run these examples, just run sh run.sh. Note that:

  1. Set train and test dataset under data/input

  2. Created features from original dataset need to be under data/output/features

  3. Models for stacking are defined in scripts.py under scripts folder

  4. Need to define created features in that scripts

  5. Just run sh run.sh (python scripts/XXX.py).

Detailed Usage

  1. Set train dataset with its target data and test dataset.

    FEATURE_LIST_stage1 = {
                    'train':(
                             INPUT_PATH + 'train.csv',
                             FEATURES_PATH + 'train_log.csv',
                            ),
    
                    'target':(
                             INPUT_PATH + 'target.csv',
                            ),
    
                    'test':(
                             INPUT_PATH + 'test.csv',
                             FEATURES_PATH + 'test_log.csv',
                            ),
                    }
  2. Define model classes that inherit BaseModel class, which are used in Stage 1, Stage 2, ..., Stage N.

    # For Stage 1
    PARAMS_V1 = {
            'colsample_bytree':0.80,
            'learning_rate':0.1,"eval_metric":"auc",
            'max_depth':5, 'min_child_weight':1,
            'nthread':4,
            'objective':'binary:logistic','seed':407,
            'silent':1, 'subsample':0.60,
            }
    
    class ModelV1(BaseModel):
            def build_model(self):
                return XGBClassifier(params=self.params, num_round=10)
    
    ...
    
    # For Stage 2
    PARAMS_V1_stage2 = {
                        'penalty':'l2',
                        'tol':0.0001, 
                        'C':1.0, 
                        'random_state':None, 
                        'verbose':0, 
                        'n_jobs':8
                        }
    
    class ModelV1_stage2(BaseModel):
            def build_model(self):
                return LR(**self.params)
  3. Train each models of Stage 1 for stacking.

    m = ModelV1(name="v1_stage1",
                flist=FEATURE_LIST_stage1,
                params = PARAMS_V1,
                kind = 'st'
                )
    m.run()
    
    ...
  4. Train each model(s) of Stage 2 by using the prediction of Stage-1 models.

    FEATURE_LIST_stage2 = {
                'train': (
                         TEMP_PATH + 'v1_stage1_all_fold.csv',
                         TEMP_PATH + 'v2_stage1_all_fold.csv',
                         TEMP_PATH + 'v3_stage1_all_fold.csv',
                         TEMP_PATH + 'v4_stage1_all_fold.csv',
                         ...
                         ),
    
                'target':(
                         INPUT_PATH + 'target.csv',
                         ),
    
                'test': (
                        TEMP_PATH + 'v1_stage1_test.csv',
                        TEMP_PATH + 'v2_stage1_test.csv',
                        TEMP_PATH + 'v3_stage1_test.csv',
                        TEMP_PATH + 'v4_stage1_test.csv',
                        ...                     
                        ),
                }
    
    # Models
    m = ModelV1_stage2(name="v1_stage2",
                    flist=FEATURE_LIST_stage2,
                    params = PARAMS_V1_stage2,
                    kind = 'st',
                    )
    m.run()
  5. Final result is saved as v1_stage2_TestInAllTrainingData.csv.

Prerequisite

  • (MaxOS) Install xgboost first manually: pip install xgboost
  • (Optional) Install paratext: fast csv loading library

Installation

To install stacking, cd to the stacking folder and run the install command**(up-to-date version, recommended)**:

sudo python setup.py install

You can also install stacking from PyPI:

pip install stacking

Files

Details of scripts

  • base.py:
    • Base models for stacking are defined here (using sklearn.base.BaseEstimator).
    • Some models are defined here. e.g., XGBoost, Keras, Vowpal Wabbit.
    • These models are wrapped as scikit-learn like (using sklearn.base.ClassifierMixin, sklearn.base.RegressorMixin).
    • That is, model class has some methods, fit(), predict_proba(), and predict().

New user-defined models can be added here.

Scikit-learn models can be used.

Base model have some arguments.

  • 's': Stacking. Saving oof(out-of-fold) prediction({model_name}_all_fold.csv) and average of test prediction based on train-fold models({model_name}_test.csv). These files will be used for next level stacking.

  • 't': Training with all data and predict test({model_name}_TestInAllTrainingData.csv). In this training, no validation data are used.

  • 'st': Stacking and then training with all data and predict test ('s' and 't').

  • 'cv': Only cross validation without saving the prediction.

Define several models and its parameters used for stacking. Define task details on the top of script. Train and test feature set are defined here. Need to define CV-fold index.

Any level stacking can be defined.

PredictionFiles

Reference

[1] Wolpert, David H. Stacked generalization, Neural Networks, 5(2), 241-259

[2] Ensemble learning(Stacking)

[3] KAGGLE ENSEMBLING GUIDE

Owner
Ikki Tanaka
Data Scientist, Machine Learning/Reinforcement Learning Engineer. Kaggle Master.
Ikki Tanaka
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
Pandas Machine Learning and Quant Finance Library Collection

Pandas Machine Learning and Quant Finance Library Collection

148 Dec 07, 2022
🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Knock Knock A small library to get a notification when your training is complete or when it crashes during the process with two additional lines of co

Hugging Face 2.5k Jan 07, 2023
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022
a distributed deep learning platform

Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

The Apache Software Foundation 2.7k Jan 05, 2023
This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022
An AutoML survey focusing on practical systems.

This project is a community effort in constructing and maintaining an up-to-date beginner-friendly introduction to AutoML, focusing on practical systems. AutoML is a big field, and continues to grow

AutoGOAL 16 Aug 14, 2022
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
Pragmatic AI Labs 421 Dec 31, 2022
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 02, 2023
Book Recommender System Using Sci-kit learn N-neighbours

Model-Based-Recommender-Engine I created a book Recommender System using Sci-kit learn's N-neighbours algorithm for my model and the streamlit library

1 Jan 13, 2022
Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máquina.

Estatistica para Ciência de Dados e Machine Learning Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máqui

Renan Barbosa 1 Jan 10, 2022
Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Priyansh Sharma 7 Nov 09, 2022
Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

Mohammad Dori 3 Jul 15, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models.

Model Serving Made Easy BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models. Supports multi

BentoML 4.4k Jan 04, 2023
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. Th

Swiggy 66 Dec 06, 2022