A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
A discord group chat creator just made it because i saw people selling this stuff for like up to 40 bucks

gccreator some discord group chat tools just made it because i saw people selling this stuff for like up to 40 bucks (im currently working on a faster

baum1810 6 Oct 03, 2022
CHIP-8 interpreter written in Python

chip8py CHIP-8 interpreter written in Python Contents About Installation Usage License About CHIP-8 is an interpreted language developed during the 19

Robert Olaru 1 Nov 09, 2021
→ Plantilla de registro para Python

🔧 Pasos Necesarios CMD 🖥️ SOCKETS pip install sockets 🎨 COLORAMA pip install colorama 💻 Código register-by-inputs from turtle import color # Impor

Panda.xyz 4 Mar 12, 2022
GDIT: Geometry Dash Info Tool

GDIT: Geometry Dash Info Tool This is the first large script that allows you to quickly get information from the Geometry Dash server

dezz0xY 2 Jan 09, 2022
Python scripts to interact with Upper Deck ePack online trading card platform

This script should connect to the Upper Deck ePack API using your browser cookies and download a list of your current collection and save it as a CSV.

Adrian Kent 1 Nov 22, 2021
Shell scripts made simple 🐚

zxpy Shell scripts made simple 🐚 Inspired by Google's zx, but made much simpler and more accessible using Python. Rationale Bash is cool, and it's ex

Tushar Sadhwani 492 Dec 27, 2022
E5 自动续期

请选择跳转 新版本系统 (2021-2-9采用): 以后更新都在AutoApi,采用v0.0版本号覆盖式更新 AutoApi : 最新版 保留1到2个稳定的简易版,防止萌新大范围报错 AutoApi'X' : 稳定版1 ( 即本版AutpApiP ) AutoApiP ( 即v5.0,稳定版 ) —

95 Feb 15, 2021
Fetch PRs from GitHub and analyze which ones are unmergeable

Set up token Generate a personal access token on GitHub. Add repo permissions. export GH_TOKEN="abcdefg" Pull PR data make Usually, GitHub doesn't h

Stefan van der Walt 1 Nov 05, 2021
A python program, imitating functionalities of a banking system

A python program, imitating functionalities of a banking system, in order for users to perform certain operations in a bank.

Moyosore Weke 1 Nov 26, 2021
Prometheus exporter for chess.com player data

chess-exporter Prometheus exporter for chess.com player data implemented via chess.com's published data API and Prometheus Python Client Example use c

Mário Uhrík 7 Feb 28, 2022
Public Management System for ACP's 24H TT Fronteira 2021

CROWD MANAGEMENT SYSTEM 24H TT Vila de Froteira 2021 This python script creates a dashboard with realtime updates regarding the capacity of spectactor

VOST Portugal 1 Nov 24, 2021
Monitor the New World login queue and notify when it is about to finish

nwwatch - Monitor the New World queue and notify when it is about to finish Getting Started install python 3.7+ navigate to the directory where you un

14 Jan 10, 2022
200 LeetCode problems

LeetCode I classify 200 leetcode problems into some categories and upload my code to who concern WEEK 1 # Title Difficulty Array 15 3Sum Medium 1324 P

Hoang Cao Bao 108 Dec 08, 2022
Structured, dependable legos for starknet development.

Structured, dependable legos for starknet development.

Alucard 127 Nov 23, 2022
RangDev Notepad App With Python

RangDev Notepad-App-With-Python Take down quick and speedy notes! This is a small project of a notepad app built with Tkinter and SQLite3. Database cr

rangga.alrasya 1 Dec 01, 2021
《赛马娘》(ウマ娘: Pretty Derby)辅助 🐎🖥 基于 auto-derby 可视化操作/设置 启动器 一键包

ok-derby 《赛马娘》(ウマ娘: Pretty Derby)辅助 🐎 🖥 基于 auto-derby 可视化操作/设置 启动器 一键包 便捷,好用的 auto_derby 管理器! 功能 支持客户端 DMM (前台) 实验性 安卓 ADB 连接(后台)开发基于 1080x1920 分辨率

秋葉あんず 90 Jan 01, 2023
WATTS provides a set of Python classes that can manage simulation workflows for multiple codes where information is exchanged at a coarse level

WATTS (Workflow and Template Toolkit for Simulation) provides a set of Python classes that can manage simulation workflows for multiple codes where information is exchanged at a coarse level.

13 Dec 23, 2022
Projeto-menu - This project is designed to learn more about control mechanisms in Python programming

Projeto-menu - This project is designed to learn more about control mechanisms in Python programming

Henrik Ricarte 2 Mar 01, 2022
Probably the best way to simulate block scopes in Python

This is a package, as it says on the tin, to emulate block scoping in Python, the lack of which being a clever design choice yet sometimes a trouble.

88 Oct 26, 2022
A simple python project that can find Tangkeke in a given image.

A simple python project that can find Tangkeke in a given image. Make the real Tangkeke image as a kernel to convolute the target image. The area wher

张志衡 1 Dec 08, 2021