SparseLasso: Sparse Solutions for the Lasso

Overview

SparseLasso: Sparse Solutions for the Lasso

Introduction

SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tuning for the penalty choice using the 'one standard error' rule to yield sparse solutions. The 'one standard error' rule recognizes the fact that the cross-validation path is estimated with error and selects the more parsimonious model (see Hastie, Tibshirani and Friedman, 2009). This rule thus chooses the largest possible penalty which is still within the one standard error of the cross-validation optimal value. Given that the Lasso often selects too many variables in practice, the one standard error rule provides a practical solution to yield sparser models. The software implementation of this rule is readily available in the R-package 'glmnet' (Friedman, Hastie and Tibshirani, 2010), however, it is absent from the Scikit-Learn module (Pedregosa et al., 2011). SparseLasso provides estimation of the penalized linear and logistic model based on Scikit-Learn's LassoCV and LogisticRegressionCV, respectively and thus accepts the standard Scikit-Learn arguments.

Installation

SparseLasso module relies on Python 3 and is based on the scikit-learn module. The required modules can be installed by navigating to the root of this project and executing the following command: pip install -r requirements.txt.

Example

The example below demonstrates the basic usage of the SparseLasso module.

# import modules
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LassoCV

# import SparseLasso
from sparse_lasso import SparseLassoCV

# simulate some example data for the linear model
X, y, coef = make_regression(n_samples=1000,
                             n_features=100, 
                             n_informative=10,
                             noise=10,
                             coef=True,
                             random_state=0)

# estimate standard LassoCV with optimal lambda minimizing error
lasso_min = LassoCV(n_alphas=100, cv=10).fit(X=X, y=y)

# estimate SparseLassoCV with lambda using 1 standard error rule
lasso_1se = SparseLassoCV(n_alphas=100, cv=10).fit(X=X, y=y)

# compare the penalty values
print('Lasso Min Penalty: ', round(lasso_min.alpha_, 2), '\n',
      'Lasso 1se Penalty: ', round(lasso_1se.alpha, 2), '\n')

# compare the number of selected features
print('Lasso Min Number of Selected Variables:     ',
      np.sum((lasso_min.coef_ != 0) * 1), '\n',
      'Lasso 1se Number of Selected Variables:     ',
      np.sum((lasso_1se.coef_ != 0) * 1), '\n')

For a more detailed example see the sparse_lasso_example.py as well as the sparse_lasso_simulation.py for a simulation exercise comparing the optimal cross-validation penalty choice with the one standard error rule for variable selection.

References

  • Hastie, Trevor, Robert Tibshirani, and J H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. , 2009. Print.
  • Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. "Regularization paths for generalized linear models via coordinate descent." Journal of statistical software 33.1 (2010): 1.
  • Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.
Owner
Gabriel Okasa
PhD Candidate in Econometrics at the University of St.Gallen, Switzerland
Gabriel Okasa
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

Tharsis Souza 5 Nov 19, 2022
Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

Alex Carter 12 Jan 05, 2023
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022
Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

168 Dec 28, 2022
Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

1 Feb 14, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022