A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Last update: Dec 19, 2022

Overview

⚡ funk-svd

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M	RMSE	MAE	Time
Surprise	0.88	0.68	10 min 40 sec
Funk-svd	0.88	0.68	42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

$R = \begin{pmatrix} {\color{Red} ?} & 2 & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & {\color{Red} ?} & 4.5 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 3 & {\color{Red} ?} & \cdots & {\color{Red} ?} & {\color{Red} ?} \\ {\color{Red} ?} & {\color{Red} ?} & \cdots & 5 & {\color{Red} ?} \end{pmatrix}$

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

$P = \begin{pmatrix} 0.37 & \cdots & 0.69 \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ \vdots & \ddots & \vdots \\ 1.08 & \cdots & 0.24 \end{pmatrix} , Q = \begin{pmatrix} 0.09 & \cdots & \cdots & \cdots & 0.46 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0.51 & \cdots & \cdots & \cdots & 0.72 \end{pmatrix}$

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

$\bar{r} = \frac{1}{N} \sum_{i=1}^{N} K_{i}$

where K stands for known ratings.

$bu = \begin{pmatrix} 0.35 & \cdots & 0.07 \end{pmatrix}$

$bi = \begin{pmatrix} 0.16 & \cdots & 0.40 \end{pmatrix}$

Then, we can estimate any rating by applying:

$\hat{r}_{u, i} = \bar{r} + bu_{u} + bi_{i} + \sum_{f=1}^{F} P_{u, f} * Q_{i, f}$

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

$err = r - \hat{r}$

$bu_{u} = bu_{u} + \alpha * (err - \lambda * bu_{u})$

$bi_{i} = bi_{i} + \alpha * (err - \lambda * bi_{i})$

$P_{u, f} = P_{u, f} + \alpha * (err * Q_{i, f} - \lambda * P_{u, f})$

$Q_{i, f} = Q_{i, f} + \alpha * (err * P_{u, f} - \lambda * Q_{i, f})$

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Related tags

Overview

⚡ funk-svd

Installation

Contributing

Quick example

Funk SVD for recommendation in a nutshell

References

License

Owner

Geoffrey Bolmier

机器学习检测webshell

2021 Machine Learning Security Evasion Competition

Code base of KU AIRS: SPARK Autonomous Vehicle Team

A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

The project's goal is to show a real world application of image segmentation using k means algorithm

Decision Weights in Prospect Theory

A naive Bayes model for cancer classification using a set of documents

Merlion: A Machine Learning Framework for Time Series Intelligence

Uplift modeling and causal inference with machine learning algorithms

This is a Machine Learning model which predicts the presence of Diabetes in Patients

A classification model capable of accurately predicting the price of secondhand cars

A modular active learning framework for Python

Machine Learning approach for quantifying detector distortion fields

Bayesian optimization in JAX

SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Scikit-Learn useful pre-defined Pipelines Hub

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

A simple python program that draws a tree for incrementing values using the Collatz Conjecture.

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.

A Python step-by-step primer for Machine Learning and Optimization