Large-scale linear classification, regression and ranking in Python

Overview
https://travis-ci.org/scikit-learn-contrib/lightning.svg?branch=master https://ci.appveyor.com/api/projects/status/mmm0llccmvn5iooq?svg=true

lightning

lightning is a library for large-scale linear classification, regression and ranking in Python.

Highlights:

  • follows the scikit-learn API conventions
  • supports natively both dense and sparse data representations
  • computationally demanding parts implemented in Cython

Solvers supported:

  • primal coordinate descent
  • dual coordinate descent (SDCA, Prox-SDCA)
  • SGD, AdaGrad, SAG, SAGA, SVRG
  • FISTA

Example

Example that shows how to learn a multiclass classifier with group lasso penalty on the News20 dataset (c.f., Blondel et al. 2013):

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier

# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

# Train the model.
clf.fit(X, y)

# Accuracy
print(clf.score(X, y))

# Percentage of selected features
print(clf.n_nonzero(percentage=True))

Dependencies

lightning requires Python >= 2.7, setuptools, Numpy >= 1.3, SciPy >= 0.7 and scikit-learn >= 0.15. Building from source also requires Cython and a working C/C++ compiler. To run the tests you will also need nose >= 0.10.

Installation

Precompiled binaries for the stable version of lightning are available for the main platforms and can be installed using pip:

pip install sklearn-contrib-lightning

or conda:

conda install -c conda-forge sklearn-contrib-lightning

The development version of lightning can be installed from its git repository. In this case it is assumed that you have the git version control system, a working C++ compiler, Cython and the numpy development libraries. In order to install the development version, type:

git clone https://github.com/scikit-learn-contrib/lightning.git
cd lightning
python setup.py build
sudo python setup.py install

Documentation

http://contrib.scikit-learn.org/lightning/

On Github

https://github.com/scikit-learn-contrib/lightning

Citing

If you use this software, please cite it. Here is a BibTex snippet that you can use:

@misc{lightning_2016,
  author       = {Blondel, Mathieu and
                  Pedregosa, Fabian},
  title        = {{Lightning: large-scale linear classification,
                 regression and ranking in Python}},
  year         = 2016,
  doi          = {10.5281/zenodo.200504},
  url          = {https://doi.org/10.5281/zenodo.200504}
}

Other citing formats are available in its Zenodo entry .

Authors

  • Mathieu Blondel, 2012-present
  • Manoj Kumar, 2015-present
  • Arnaud Rachez, 2016-present
  • Fabian Pedregosa, 2016-present
Comments
  • [MRG] Parallelize OvR method in primal_cd

    [MRG] Parallelize OvR method in primal_cd

    @mblondel I was trying to get some speed gains by parallelizing the OvR method. However when I set n_jobs>1 it keeps failing with this error, TypeError: __cinit__() takes exactly 1 positional argument (0 given). Note that it works like how it is supposed to for n_jobs=1

    opened by MechCoder 37
  • [WIP] Adding prox capability to SAGA.

    [WIP] Adding prox capability to SAGA.

    Continuing #37 after discussing with @fabianp. Added prox capability in _sag_fit of file lightning/impl/sag_fast.pyx where @fabianp left room for it.

    The proximity operator is currently specified when a classifier/regressor is built with the prox keyword (type ProxFunction mimicking LossFunction in lightning/impl/sgd_fast.pyx). Not sure this is the best way to specify it by default...

    Notes prox implementation breaks sparse updates and the code is excruciatingly slow on sklearn.datasets.fetch_20newsgroups_vectorized (cf. this gist)

    • [x] Draf of proximity operators.
    • [x] Need to add tests.
    • [x] Add sparsity in L1
    opened by zermelozf 31
  • [MRG] Just in time SAGA.

    [MRG] Just in time SAGA.

    A squashed version of #38 ontaining:

    • SAGA algorithm in cython.
    • Basic python version of SAG and SAGA for testing.
    • Support for proximity operators through the Penalty base class.
    • L1 proximity operator with just in time update for sparse data.
    opened by zermelozf 24
  • Documentation update

    Documentation update

    Hi @mblondel . Some of the recent additions (such as SAGA) don't show up in the webpage. Would you mind pushing a new version of the doc? (I wouldn't mind doing it myself if it was on github pages)

    opened by fabianp 18
  • FIX for SAG with sparse samples.

    FIX for SAG with sparse samples.

    The problem was that when the solution was updated just in time the different scaling accumulated were not considered. They were treated as if they had been constant in the last iterations.

    This should fix issue #33 , although because of some python 3 incompatibility I've not yet run the full test suite.

    opened by fabianp 14
  • raise AttributeError if predict_proba is not available

    raise AttributeError if predict_proba is not available

    In scikit-learn when predit_proba method is not available, AttributeError is raised instead of NotImplementedError. In this PR:

    • classifiers are changed to follow the same convention;
    • removed predict_log_proba mentions because lightning doesn't provide this method;
    • added more tests for predict_proba results.
    opened by kmike 12
  • 0.1 release

    0.1 release

    I'd like to do a 0.1 release and upload binary packages to pypi and conda. TODO:

    • [x] Make binary conda packages for (at least) windows (appveyor).
    • [x] Update README with build instructions for binary packages.
    • [x] Update the website with the latests stable version.
    • [x] Create maintenance branch 0.1.X
    • [x] After release, upgrade version number to 0.2.dev0.

    What do you think @mblondel ?

    opened by fabianp 12
  • Release `0.6.2`

    Release `0.6.2`

    I believe Python 3.10 support that has been added recently (3afcb4a9967a0d9e3961acd967705e42a593e448) deserves new release of the package. In new release we'll upload wheels for Python 3.10 making users' life easier.

    opened by StrikerRUS 11
  • Build artifacts at GitHub Actions

    Build artifacts at GitHub Actions

    Wheels for all platforms and source archive will be automatically uploaded to Releases tab with each tagged commit.

    For example please refer to https://github.com/StrikerRUS/lightning/releases/tag/untagged-a19e7c8d925f0295f2b6.

    Unfortunately, neither manylinux2010 nor manylinux1 containers cannot be used due to the following restriction of Node.js: https://github.com/actions/runner/issues/337. But I think manylinux2014 is better than nothing. Moreover, CentOS 6 and CentOS 5 on which those containers are based have already reached their EOL. https://github.com/pypa/manylinux

    opened by StrikerRUS 11
  • Should the .pxd files be included with the distribution?

    Should the .pxd files be included with the distribution?

    I'm working on a package that uses lightning cython code as a dependency via:

    from lightning.impl.dataset_fast cimport ColumnDataset.

    When installing lightning via conda or pip, generating the cython file fails, but if I distribute the generated cpp files the code runs fine.

    Should the .pxd files be distributed with lightning to allow this use case?

    opened by vene 10
  • [HOTFIX] fix compatibility with new scikit-learn version

    [HOTFIX] fix compatibility with new scikit-learn version

    This PR will allow using lightning with the latest version (0.23.0) of scikit-learn. Right now if you try to upgrade scikit-learn, lightning fails with the error about that it cannot import neither joblib nor six because they are no longer exist in sklearn.externals:

        from lightning.classification import KernelSVC
    ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/classification.py:1: in <module>
        from .impl.adagrad import AdaGradClassifier
    ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/impl/adagrad.py:8: in <module>
        from sklearn.externals.six.moves import xrange
    

    six was dropped along with Python 2 support. https://scikit-learn.org/stable/whats_new/v0.21.html#sklearn-externals

    joblib is now a dependency: https://scikit-learn.org/stable/whats_new/v0.21.html#miscellaneous

    This PR should be treated as a hotfix, and ideally lightning should drop the support of Python 2 with six dependency.

    opened by StrikerRUS 9
  • Why not initialize SAG/SAGA memory with 0 and divide by seen indices so far as in sklearn?

    Why not initialize SAG/SAGA memory with 0 and divide by seen indices so far as in sklearn?

    Why you don't use initialize gradient memory with 0 and use the number of indices seen so far in SAG algorithm as suggested in the paper

    In the update of x in Algorithm 1, we normalize the direction d by the total number of data points n. When initializing with y_i = 0 we believe this leads to steps that are too small on early iterations of the algorithm where we have only seen a fraction of the data points, because many y_i variables contributing to d are set to the uninformative zero-vector. Following Blatt et al. [2007], the more logical normalization is to divide d by m, the number of data points that we have seen at least once

    SAGA paper suggests a similar procedure

    Our algorithm assumes that initial gradients are known for each f_i at the starting point x0. Instead, a heuristic may be used where during the first pass, data-points are introduced one-by-one, in a non-randomized order, with averages computed in terms of those data-points processed so far. This procedure has been successfully used with SAG [1].

    opened by NikZak 0
  • DOC: sometimes the Lasso solution is the same as sklearn, sometimes not

    DOC: sometimes the Lasso solution is the same as sklearn, sometimes not

    Hi @mblondel @fabianp I think this will be short to answer, why is the solution sometimes equal to that of sklearn, and sometimes not ?

    This should be quick to reproduce, look at 1st and 3rd result over 5 seeds:

    import numpy as np
    from numpy.linalg import norm
    from lightning.regression import CDRegressor
    from sklearn.linear_model import Lasso
    
    np.random.seed(0)
    X = np.random.randn(200, 500)
    beta = np.ones(X.shape[1])
    beta[20:] = 0
    y = X @ beta + 0.3 * np.random.randn(X.shape[0])
    alpha = norm(X.T @ y, ord=np.inf) / 10
    
    
    def p_obj(X, y, alpha, w):
        return norm(y - X @ w) ** 2 / 2 + alpha * norm(w, ord=1)
    
    
    for seed in range(5):
        print('-' * 80)
        clf = CDRegressor(C=0.5, alpha=alpha, penalty='l1',
                          tol=1-30, random_state=seed)
        clf.fit(X, y)
    
        las = Lasso(fit_intercept=False, alpha=alpha/len(y), tol=1e-10).fit(X, y)
        print(norm(clf.coef_[0] - las.coef_))
    
        light_o = p_obj(X, y, alpha, clf.coef_[0])
        sklea_o = p_obj(X, y, alpha, las.coef_)
    
        print(light_o - sklea_o)
    

    ping @qb3 @agramfort

    opened by mathurinm 5
  • do you have  Regression for spars categorical big data   after one hot transformation

    do you have Regression for spars categorical big data after one hot transformation

    do you have Regression for spars categorical big data after one hot transformation

    then data is spars and only ones and zeros values many zeros and few ones?

    opened by Sandy4321 0
Releases(0.6.2.post0)
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 01, 2023
Extra blocks for scikit-learn pipelines.

scikit-lego We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to atte

vincent d warmerdam 941 Dec 30, 2022
Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

celer Fast algorithm to solve Lasso-like problems with dual extrapolation. Currently, the package handles the following problems: Lasso weighted Lasso

168 Dec 13, 2022
Multivariate imputation and matrix completion algorithms implemented in Python

A variety of matrix completion and imputation algorithms implemented in Python 3.6. To install: pip install fancyimpute Do not use conda. We don't sup

Alex Rubinsteyn 1.1k Dec 18, 2022
Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

Andreas Mueller 122 Dec 27, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 02, 2023
machine learning with logical rules in Python

skope-rules Skope-rules is a Python machine learning module built on top of scikit-learn and distributed under the 3-Clause BSD license. Skope-rules a

504 Dec 31, 2022
A Python library for dynamic classifier and ensemble selection

DESlib DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and

425 Dec 18, 2022
Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

Scikit-TDA 373 Dec 24, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 28, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Large-scale linear classification, regression and ranking in Python

lightning lightning is a library for large-scale linear classification, regression and ranking in Python. Highlights: follows the scikit-learn API con

1.6k Dec 31, 2022
scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

745 Jan 05, 2023
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

418 Jan 09, 2023
(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

combo: A Python Toolbox for Machine Learning Model Combination Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License

Yue Zhao 606 Dec 21, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

803 Jan 05, 2023