🤖 ⚡ scikit-learn tips

Overview

🤖 scikit-learn tips

New tips are posted on LinkedIn, Twitter, and Facebook.

👉 Sign up to receive 2 video tips by email every week! 👈

List of all tips

Click to discuss the tip on LinkedIn, click to view the Jupyter notebook for a tip, or click to watch the tip video on YouTube:

# Description Links
1 Use ColumnTransformer to apply different preprocessing to different columns
2 Seven ways to select columns using ColumnTransformer
3 What is the difference between "fit" and "transform"?
4 Use "fit_transform" on training data, but "transform" (only) on testing/new data
5 Four reasons to use scikit-learn (not pandas) for ML preprocessing
6 Encode categorical features using OneHotEncoder or OrdinalEncoder
7 Handle unknown categories with OneHotEncoder by encoding them as zeros
8 Use Pipeline to chain together multiple steps
9 Add a missing indicator to encode "missingness" as a feature
10 Set a "random_state" to make your code reproducible
11 Impute missing values using KNNImputer or IterativeImputer
12 What is the difference between Pipeline and make_pipeline?
13 Examine the intermediate steps in a Pipeline
14 HistGradientBoostingClassifier natively supports missing values
15 Three reasons not to use drop='first' with OneHotEncoder
16 Use cross_val_score and GridSearchCV on a Pipeline
17 Try RandomizedSearchCV if GridSearchCV is taking too long
18 Display GridSearchCV or RandomizedSearchCV results in a DataFrame
19 Important tuning parameters for LogisticRegression
20 Plot a confusion matrix
21 Compare multiple ROC curves in a single plot
22 Use the correct methods for each type of Pipeline
23 Display the intercept and coefficients for a linear model
24 Visualize a decision tree two different ways
25 Prune a decision tree to avoid overfitting
26 Use stratified sampling with train_test_split
27 Two ways to impute missing values for a categorical feature
28 Save a model or Pipeline using joblib
29 Vectorize two text columns in a ColumnTransformer
30 Four ways to examine the steps of a Pipeline
31 Shuffle your dataset when using cross_val_score
32 Use AUC to evaluate multiclass problems
33 Use FunctionTransformer to convert functions into transformers
34 Add feature selection to a Pipeline
35 Don't use .values when passing a pandas object to scikit-learn
36 Most parameters should be passed as keyword arguments
37 Create an interactive diagram of a Pipeline in Jupyter
38 Get the feature names output by a ColumnTransformer
39 Load a toy dataset into a DataFrame
40 Estimators only print parameters that have been changed
41 Drop the first category from binary features (only) with OneHotEncoder
42 Passthrough some columns and drop others in a ColumnTransformer
43 Use OrdinalEncoder instead of OneHotEncoder with tree-based models
44 Speed up GridSearchCV using parallel processing
45 Create feature interactions using PolynomialFeatures
46 Ensemble multiple models using VotingClassifer or VotingRegressor
47 Tune the parameters of a VotingClassifer or VotingRegressor
48 Access part of a Pipeline using slicing
49 Tune multiple models simultaneously with GridSearchCV
50 Adapt this pattern to solve many Machine Learning problems

You can interact with all of these notebooks online using Binder:

Note: Some of the tips do not include any code, and can only be viewed on LinkedIn.

Who creates these tips?

Hi! I'm Kevin Markham, the founder of Data School. I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.

How can I get better at scikit-learn?

I teach three courses:

👉 Find out which course is right for you! 👈

Do you have any other tips?

Yes! In 2019, I posted 100 pandas tricks. I also created a video featuring my top 25 pandas tricks.

© 2020-2021 Data School. All rights reserved.

Owner
Kevin Markham
Founder of Data School
Kevin Markham
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
🔬 A curated list of awesome machine learning strategies & tools in financial market.

🔬 A curated list of awesome machine learning strategies & tools in financial market.

GeorgeZou 1.6k Dec 30, 2022
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

Erdogan Taskesen 187 Dec 30, 2022
SPCL 48 Dec 12, 2022
Combines Bayesian analyses from many datasets.

PosteriorStacker Combines Bayesian analyses from many datasets. Introduction Method Tutorial Output plot and files Introduction Fitting a model to a d

Johannes Buchner 19 Feb 13, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Generate music from midi files using BPE and markov model

Generate music from midi files using BPE and markov model

Aditya Khadilkar 37 Oct 24, 2022
Tutorial for Decision Threshold In Machine Learning.

Decision-Threshold-ML Tutorial for improve skills: 'Decision Threshold In Machine Learning' (from GeeksforGeeks) by Marcus Mariano For more informatio

0 Jan 20, 2022
The code from the Machine Learning Bookcamp book and a free course based on the book

The code from the Machine Learning Bookcamp book and a free course based on the book

Alexey Grigorev 5.5k Jan 09, 2023
Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Thines Kumar 1 Jan 31, 2022
CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

COIN-OR Foundation 161 Dec 14, 2022
Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

Maciej 114 Nov 22, 2022
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.4k Jan 15, 2022
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
机器学习检测webshell

ai-webshell-detect 机器学习检测webshell,利用textcnn+简单二分类网络,基于keras,花了七天 检测原理: 从文件熵 文件长度 文件语句提取出特征,然后文件熵与长度送入二分类网络,文件语句送入textcnn 项目原理,介绍,怎么做出来的

Huoji's 56 Dec 14, 2022
Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

Python Streamz 1.1k Dec 28, 2022
A simple application that calculates the probability distribution of a normal distribution

probability-density-function General info An application that calculates the probability density and cumulative distribution of a normal distribution

1 Oct 25, 2022
Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de análise de linguagens de programação de 2

Victor Hugo Negrisoli 0 Jun 29, 2021