PySpark + Scikit-learn = Sparkit-learn

Last update: Jan 04, 2023

Overview

Sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

GitHub: https://github.com/lensacom/sparkit-learn

About

Sparkit-learn aims to provide scikit-learn functionality and API on PySpark. The main goal of the library is to create an API that stays close to sklearn's.

The driving principle was to "Think locally, execute distributively." To accomodate this concept, the basic data block is always an array or a (sparse) matrix and the operations are executed on block level.

Requirements

Python 2.7.x or 3.4.x
Spark[>=1.3.0]
NumPy[>=1.9.0]
SciPy[>=0.14.0]
Scikit-learn[>=0.16]

Run IPython from notebooks directory

PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS="notebook" ${SPARK_HOME}/bin/pyspark --master local\[4\] --driver-memory 2G

Run tests with

./runtests.sh

Quick start

Sparkit-learn introduces three important distributed data format:

ArrayRDD:

A numpy.array like distributed array

from splearn.rdd import ArrayRDD

data = range(20)
# PySpark RDD with 2 partitions
rdd = sc.parallelize(data, 2) # each partition with 10 elements
# ArrayRDD
# each partition will contain blocks with 5 elements
X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition

Basic operations:

len(X) # 20 - number of elements in the whole dataset
X.blocks # 4 - number of blocks
X.shape # (20,) - the shape of the whole dataset

X # returns an ArrayRDD
# <class 'splearn.rdd.ArrayRDD'> from PythonRDD...

X.dtype # returns the type of the blocks
# numpy.ndarray

X.collect() # get the dataset
# [array([0, 1, 2, 3, 4]),
#  array([5, 6, 7, 8, 9]),
#  array([10, 11, 12, 13, 14]),
#  array([15, 16, 17, 18, 19])]

X[1].collect() # indexing
# [array([5, 6, 7, 8, 9])]

X[1] # also returns an ArrayRDD!

X[1::2].collect() # slicing
# [array([5, 6, 7, 8, 9]),
#  array([15, 16, 17, 18, 19])]

X[1::2] # returns an ArrayRDD as well

X.tolist() # returns the dataset as a list
# [0, 1, 2, ... 17, 18, 19]
X.toarray() # returns the dataset as a numpy.array
# array([ 0,  1,  2, ... 17, 18, 19])

# pyspark.rdd operations will still work
X.getNumPartitions() # 2 - number of partitions

SparseRDD:

The sparse counterpart of the ArrayRDD, the main difference is that the blocks are sparse matrices. The reason behind this split is to follow the distinction between numpy.ndarray*s and *scipy.sparse matrices. Usually the SparseRDD is created by splearn's transformators, but one can instantiate too.

# generate a SparseRDD from a text using SparkCountVectorizer
from splearn.rdd import SparseRDD
from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS
ALL_FOOD_DOCS
#(u'the pizza pizza beer copyright',
# u'the pizza burger beer copyright',
# u'the the pizza beer beer copyright',
# u'the burger beer beer copyright',
# u'the coke burger coke copyright',
# u'the coke burger burger',
# u'the salad celeri copyright',
# u'the salad salad sparkling water copyright',
# u'the the celeri celeri copyright',
# u'the tomato tomato salad water',
# u'the tomato salad water copyright')

# ArrayRDD created from the raw data
X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)
X.collect()
# [array([u'the pizza pizza beer copyright',
#         u'the pizza burger beer copyright'], dtype='<U31'),
#  array([u'the the pizza beer beer copyright',
#         u'the burger beer beer copyright'], dtype='<U33'),
#  array([u'the coke burger coke copyright',
#         u'the coke burger burger'], dtype='<U30'),
#  array([u'the salad celeri copyright',
#         u'the salad salad sparkling water copyright'], dtype='<U41'),
#  array([u'the the celeri celeri copyright',
#         u'the tomato tomato salad water'], dtype='<U31'),
#  array([u'the tomato salad water copyright'], dtype='<U32')]

# Feature extraction executed
from splearn.feature_extraction.text import SparkCountVectorizer
vect = SparkCountVectorizer()
X = vect.fit_transform(X)
# and we have a SparseRDD
X
# <class 'splearn.rdd.SparseRDD'> from PythonRDD...

# it's type is the scipy.sparse's general parent
X.dtype
# scipy.sparse.base.spmatrix

# slicing works just like in ArrayRDDs
X[2:4].collect()
# [<2x11 sparse matrix of type '<type 'numpy.int64'>'
#   with 7 stored elements in Compressed Sparse Row format>,
#  <2x11 sparse matrix of type '<type 'numpy.int64'>'
#   with 9 stored elements in Compressed Sparse Row format>]

# general mathematical operations are available
X.sum(), X.mean(), X.max(), X.min()
# (55, 0.45454545454545453, 2, 0)

# even with axis parameters provided
X.sum(axis=1)
# matrix([[5],
#         [5],
#         [6],
#         [5],
#         [5],
#         [4],
#         [4],
#         [6],
#         [5],
#         [5],
#         [5]])

# It can be transformed to dense ArrayRDD
X.todense()
# <class 'splearn.rdd.ArrayRDD'> from PythonRDD...
X.todense().collect()
# [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],
#         [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),
#  array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],
#         [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),
#  array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],
#         [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),
#  array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],
#         [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),
#  array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],
#         [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),
#  array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]

# One can instantiate SparseRDD manually too:
sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)
sparse = SparseRDD(sparse, bsize=5)
sparse
# <class 'splearn.rdd.SparseRDD'> from PythonRDD...

sparse.collect()
# [<10x2 sparse matrix of type '<type 'numpy.float64'>'
#   with 10 stored elements in Compressed Sparse Row format>,
#  <10x2 sparse matrix of type '<type 'numpy.float64'>'
#   with 10 stored elements in Compressed Sparse Row format>,
#  <10x2 sparse matrix of type '<type 'numpy.float64'>'
#   with 10 stored elements in Compressed Sparse Row format>,
#  <10x2 sparse matrix of type '<type 'numpy.float64'>'
#   with 10 stored elements in Compressed Sparse Row format>]

DictRDD:

A column based data format, each column with it's own type.

from splearn.rdd import DictRDD

X = range(20)
y = list(range(2)) * 10
# PySpark RDD with 2 partitions
X_rdd = sc.parallelize(X, 2) # each partition with 10 elements
y_rdd = sc.parallelize(y, 2) # each partition with 10 elements
# DictRDD
# each partition will contain blocks with 5 elements
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            bsize=5,
            dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition
# if no dtype is provided, the type of the blocks will be determined
# automatically

# or:
import numpy as np

data = np.array([range(20), list(range(2))*10]).T
rdd = sc.parallelize(data, 2)
Z = DictRDD(rdd,
            columns=('X', 'y'),
            bsize=5,
            dtype=[np.ndarray, np.ndarray])

Basic operations:

len(Z) # 8 - number of blocks
Z.columns # returns ('X', 'y')
Z.dtype # returns the types in correct order
# [numpy.ndarray, numpy.ndarray]

Z # returns a DictRDD
#<class 'splearn.rdd.DictRDD'> from PythonRDD...

Z.collect()
# [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
#  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
#  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),
#  (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]

Z[:, 'y'] # column select - returns an ArrayRDD
Z[:, 'y'].collect()
# [array([0, 1, 0, 1, 0]),
#  array([1, 0, 1, 0, 1]),
#  array([0, 1, 0, 1, 0]),
#  array([1, 0, 1, 0, 1])]

Z[:-1, ['X', 'y']] # slicing - DictRDD
Z[:-1, ['X', 'y']].collect()
# [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
#  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
#  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]

Basic workflow

With the use of the described data structures, the basic workflow is almost identical to sklearn's.

Distributed vectorizing of texts

SparkCountVectorizer

from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local = CountVectorizer()
dist = SparkCountVectorizer()

result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd)  # SparseRDD

SparkHashingVectorizer

from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local = HashingVectorizer()
dist = SparkHashingVectorizer()

result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd)  # SparseRDD

SparkTfidfTransformer

from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.pipeline import SparkPipeline

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer())
))

result_local = local_pipeline.fit_transform(X)
result_dist = dist_pipeline.fit_transform(X_rdd)  # SparseRDD

Distributed Classifiers

from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

X = [...]  # list of texts
y = [...]  # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer()),
    ('clf', SparkLinearSVC())
))

local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

Distributed Model Selection

from splearn.rdd import DictRDD
from splearn.grid_search import SparkGridSearchCV
from splearn.naive_bayes import SparkMultinomialNB

from sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

X = [...]
y = [...]
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

parameters = {'alpha': [0.1, 1, 10]}
fit_params = {'classes': np.unique(y)}

local_estimator = MultinomialNB()
local_grid = GridSearchCV(estimator=local_estimator,
                          param_grid=parameters)

estimator = SparkMultinomialNB()
grid = SparkGridSearchCV(estimator=estimator,
                         param_grid=parameters,
                         fit_params=fit_params)

local_grid.fit(X, y)
grid.fit(Z)

ROADMAP

[ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)
[ ] Update all dependencies
[ ] Use Mllib and ML packages more extensively (since it becames more mature)
[ ] Support Spark DataFrames

Special thanks

scikit-learn community
spylearn community
pyspark community

Similar Projects

Thunder
Bolt

Comments

DBSCAN Import Error

I have been trying to run DBSCAN, using Python from command line .. I got this error

ImportError: cannot import name _get_unmangled_double_vector_rdd

Any one can help me regarding this ?

opened by Elbehery 6
Poor performances

Hi to all! I just started to deep into spark machine learning, coming from scikit-learn. I tried to fit a linear SVC from scikit-learn and sparkit-learn. Splearn is remaining slower than scikit. How is this possible? (I am attaching my code and results)

import time as t from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from sklearn.svm import LinearSVC from splearn.svm import SparkLinearSVC from splearn.rdd import ArrayRDD, DictRDD import numpy as np

X,y=make_classification(n_samples=20000,n_classes=2) print 'Dataset created. # of samples: ',X.shape[0] skstart = t.time() dt=DecisionTreeClassifier() local_clf = LinearSVC() local_clf.fit(X,y)

sktime = t.time()-skstart print 'Scikit-learn fitting time: ',sktime

spstart= t.time() X_rdd=sc.parallelize(X,20) y_rdd=sc.parallelize(y,20) Z = DictRDD((X_rdd, y_rdd), columns=('X', 'y'), dtype=[np.ndarray, np.ndarray])

distr_clf = SparkLinearSVC() distr_clf.fit(Z,np.unique(y)) sptime = t.time()-spstart print 'Spark time: ',sptime

============== RESULTS ================= Dataset created. # of samples: 20000 Scikit-learn fitting time: 3.03552293777 Spark time: 3.919039011

OR for less samples: Dataset created. # of samples: 2000 Scikit-learn fitting time: 0.244801998138 Spark time: 3.15833210945

opened by orfi2017 3
Fixes
Hello,

Here are some fixes for issues I encountered using sparkit learn. If you prefer 1 pull request per fix I can provide them.

The check for numpy is often failing and does not seem very important

Some .persist() where very aggressive

SparkFeatureUnion does not handle more than 2 steps, DictRDD and there was an issue with a return

Best,
opened by taynaud 3
few readme.md and setup.py corrections

I'm not sure if this stuff was supposed to work or if it was untested rough draft. Creating a PR for few things I've noticed, I'll dig more if you think I should.

opened by vchollati 3
Validate Vocabulary issue

First of all, thanks for the git. It is very helpful.

When running splearn/feature_extraction/text.py in pyspark shell, I am getting "AttributeError: 'SparkCountVectorizer' object has no attribute '_validate_vocabulary' " in fit_transform method. (Is it need to be '_init_vocab' or something??)

Python 2.7 Spark 1.2.0 Scikit 0.15.2 numpy 1.9.0

Code I am using - vect = text.SparkCountVectorizer() result_dist = vect.fit_transform(docs).collect()

If an appropriate to post the issue details here, please redirect the same. Thanks in advance
question

opened by raghunittala 3

ImportError: No module named _common

From AWS EMR 5.4, Spark 2.1.0, I can't import dbscan

File "/usr/local/lib/python2.7/site-packages/splearn/cluster/dbscan.py", line 3, in <module>
  from pyspark.mllib._common import (_get_unmangled_double_vector_rdd,
ImportError: No module named _common

opened by nicerobot 2

Fix issue in bayes predict_proba

predict_proba map to scikit predict_proba, which call self.predict_proba. As sparkit-learn herits from scikit bayes, this call sparkit-learn predict_proba which fail on numpy or scipy array.

opened by taynaud 2

Error in Creating DictRdd: Can only zip RDDs with same number of elements in each partition

I am trying to create a DictRdd as follows:

cleanedRdd=sc.sequenceFile(path="hdfs:///bdpilot/text_mining/sequence_input_with_target_v",minSplits=100)
train_rdd,test_rdd = cleanedRdd.randomSplit([0.7,0.3])
train_rdd.saveAsSequenceFile("hdfs:///bdpilot/text_mining/sequence_train_input")

train_rdd = sc.sequenceFile(path="hdfs:///bdpilot3_h/text_mining/sequence_train_input",minSplits=100)

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]))
train_text = train_rdd.map(lambda(x,y): y.split("~")[0])

train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

But, I get the follwing error, when I do:

train_Z.first()

org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition

I tried the following as well, but with no sucess:

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]),perservesPartitioning=True)
train_text = train_rdd.map(lambda(x,y): y.split("~")[0],perservesPartitioning=True)
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

opened by mrshanth 2

Spark version 1.2.1, error AttributeError: object has no attribute treeReduce

countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process) count_vector <class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168 sel_vt = SparkVarianceThreshold() red_vt_vector = sel_vt.fit_transform(count_vector) Traceback (most recent call last): File "", line 1, in File "base.py", line 63, in fit_transform return self.fit(Z, **fit_params).transform(Z) File "feature_selection.py", line 72, in fit _, , self.variances = X.map(mapper).treeReduce(reducer) File "rdd.py", line 179, in getattr self.class, attr)) AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

I am using spark 1.2.1, and I think rdd has the method treeReduce. Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD

opened by vishalrajpal25 2
TypeError: 'Broadcast' object is unsubscriptable
We are trying to create a Count Vector using SparkCountVectorizer. We are using Python 2.6.6, hence replaced all the dict_comprehensions in the code. We ran into this following error:

File "base.py", line 19, in func_wrapper return func(*args, **kwargs) File "splearn_custom.py", line 176, in _count_vocab j_indices.append(vocabulary[feature]) TypeError: 'Broadcast' object is unsubscriptable

Here, splearn_custom.py refers to feature_extraction/text.py

Thanks in advance
opened by mrshanth 2
Syntax Errors
I am getting syntax errors in

splearn/base.py :

for name in self.transient}, line 12

splearn/feature_extraction/text.py :

vocabulary = {t: i for i, t in enumerate(accum.value)}, line 154
opened by mrshanth 2
Examples missing

Hi!

I slipped into different sub directories to have a look at end to end demonstration of features provided by spark kit-learn but couldn't find them. However, a directory named examples exists but it seems empty. If you permit, can I work towards writing code based tutorials for spark kit-learn?

Thanks!

opened by bhav09 0
docs: fix simple typo, unnunnecessary -> unnecessary

There is a small typo in splearn/rdd.py.

Should read unnecessary rather than unnunnecessary.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 0

ImportError: cannot import name _check_numpy_unicode_bug

I got the error when importing the SparkitLabelEncoder module with scikit-learn version 0.19.1.

>>> from splearn.preprocessing import SparkLabelEncoder
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/splearn/preprocessing/__init__.py", line 1, in <module>
    from .label import SparkLabelEncoder
  File "/usr/lib/python2.7/site-packages/splearn/preprocessing/label.py", line 3, in <module>
    from sklearn.preprocessing.label import _check_numpy_unicode_bug
ImportError: cannot import name _check_numpy_unicode_bug

opened by dankiho 0

[Question] ArrayRDD to Pyspark Dataframe?

Hi - thanks so much for this package!

I came to this repo because I need to run a scikit-learn predictive model on Spark. It is easy to map the model with ArrayRDDs. However, my postprocessing assumes a PySpark DataFrame. Is there a way to convert an ArrayRDD to a DataFrame?

I appreciate any help, thanks!

opened by osimpson 1
Import error cannot import name "frombuffer_empty"

I installed the latest version of sparkit-learn and I got the error

from splearn.feature_extraction.text import SparkCountVectorizer Traceback (most recent call last): File "", line 1, in File "/home/gaurishk/anaconda3/lib/python3.6/site-packages/splearn/feature_extraction/init.py", line 3, in from .text import SparkCountVectorizer File "/home/gaurishk/anaconda3/lib/python3.6/site-packages/splearn/feature_extraction/text.py", line 16, in from sklearn.utils.fixes import frombuffer_empty ImportError: cannot import name 'frombuffer_empty

opened by thak123 2
What is the roadmap for this project: is it moribund?

There is little to no activity for over one year: and one of the issues recommends to use spark ml/mllib instead. Would the owners please clarify whether this project is intended to be supported moving forward? I would like to know in order to calibrate whether to add some algorithm here or independently.

opened by javadba 1

Releases(0.2.5)

0.2.5(Jun 24, 2015)

Source code(tar.gz)
Source code(zip)
0.1.5(Jun 12, 2015)

Source code(tar.gz)
Source code(zip)

Owner

Lensa

GitHub Repository

Basic Docker Compose for Machine Learning Purposes

Docker-compose for Machine Learning How to use: cd docker-ml-jupyterlab

1 Oct 29, 2021

All-in-one web-based development environment for machine learning

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

3 Feb 03, 2021

This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 03, 2021

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. I

2.4k Jan 02, 2023

The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

9 Sep 09, 2022

SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Synapse Machine Learning SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. Sy

3.9k Dec 30, 2022

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

297 Dec 13, 2022

Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

120 Dec 24, 2022

Test symmetries with sklearn decision tree models

Test symmetries with sklearn decision tree models Setup Begin from an environment with a recent version of python 3. source setup.sh Leave the enviro

2 Jul 19, 2022

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

5 Jun 18, 2022

Case studies with Bayesian methods

8 Nov 26, 2022

A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

1 Dec 17, 2021

Continuously evaluated, functional, incremental, time-series forecasting

timemachines Autonomous, univariate, k-step ahead time-series forecasting functions assigned Elo ratings You can: Use some of the functionality of a s

343 Jan 04, 2023

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

14.5k Jan 07, 2023

MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

2.8k Jan 06, 2023

Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

4 Jan 27, 2022

Python based GBDT implementation

Py-boost: a research tool for exploring GBDTs Modern gradient boosting toolkits are very complex and are written in low-level programming languages. A

20 Sep 21, 2022

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

1.3k Jan 06, 2023

Bayesian optimization in JAX

26 May 11, 2022

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023