A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
Streamlit component to display topics from Streamlit's community forum related to any exception.

streamlit-forum Streamlit component to display topics from Streamlit's community forum related to any exception. Installation pip install streamlit-fo

Snehan Kekre 7 Jul 15, 2022
Python dictionaries with advanced dot notation access

from box import Box movie_box = Box({ "Robin Hood: Men in Tights": { "imdb stars": 6.7, "length": 104 } }) movie_box.Robin_Hood_Men_in_Tights.imdb_s

Chris Griffith 2.1k Dec 28, 2022
Hoopoe - Get notified of important stuff, right away.

Hoopoe - Get notified of important stuff, right away. Report a Bug · Request a Feature . Ask a Question Table of Contents About Getting Started Prereq

Vahid Al 8 Nov 12, 2022
Automatic certificate unpinning for Android apps

What is this? Script used to perform automatic certificate unpinning of an APK by adding a custom network security configuration that permits user-add

Antoine Neuenschwander 5 Jul 28, 2021
A place where one-off ideas/partial projects can live comfortably

A place to post ideas, partial projects, or anything else that doesn't necessarily warrant its own repo, from my mind to the web.

Carson Scott 2 Feb 25, 2022
This is some simple code to scrape vistbook's system to get an overview of the different cabins availability.

DNT_cabin_availability_system This is some simple code to scrape visbook's system to get an overview of the different cabins availability. The system

Andreas Lorentzen 1 Sep 25, 2022
Release for Improved Denoising Diffusion Probabilistic Models

improved-diffusion This is the codebase for Improved Denoising Diffusion Probabilistic Models. Usage This section of the README walks through how to t

OpenAI 1.2k Dec 30, 2022
Python code to control laboratory hardware and perform Bayesian reaction optimization on the MIT Make-It system for chemical synthesis

Description This repository contains code accompanying the following paper on the Make-It robotic flow chemistry platform developed by the Jensen Rese

Anirudh Nambiar 11 Dec 10, 2022
Stack BOF Protection Bypass Techniques

Stack Buffer Overflow - Protection Bypass Techniques

ommadawn46 18 Dec 28, 2022
Nag0mi ctf problem 2021 writeup

Nag0mi ctf problem 2021 writeup

3 Apr 04, 2022
Site de gestion de cave à vin utilisant une BDD manipulée avec SQLite3 via Python

cave-vin Site de gestion de cave à vin utilisant une bdd manipulée avec MySQL ACCEDER AU SITE : Pour accéder à votre cave vous aurez besoin de lancer

Elouann Lucas 0 Jul 05, 2022
Algorand Python API examples

Algorand-Py Algorand Python API examples This repo will hold example scripts to monitor activities on Algorand main net. You can: Monitor your assets

Karthik Dutt 2 Jan 23, 2022
Simple calculator made in python

calculator Uma alculadora simples feita em python CMD, PowerShell, Bash ✔️ Início 💻 apt-get update apt-get upgrade -y apt-get install python git git

Spyware 8 Dec 28, 2021
Uproot - A script to bring deeply nested files or directories to the surface

UPROOT Bring deeply nested files or folders to the surface Uproot helps convert

Ted 2 Jan 15, 2022
emoji-math computes the given python expression and returns either the value or the nearest 5 emojis as measured by cosine similarity.

emoji-math computes the given python expression and returns either the value or the nearest 5 emojis as measured by cosine similarity.

Andrew White 13 Dec 11, 2022
A collection of daily usage utility scripts in python. Helps in automation of day to day repetitive tasks.

Kush's Utils Tool is my personal collection of scripts which is used to automated daily tasks. It is a evergrowing collection of scripts and will continue to evolve till the day I program. This is al

Kushagra 10 Jan 16, 2022
Todos os exercícios do Curso de Python, do canal Curso em Vídeo, resolvidos em Python, Javascript, Java, C++, C# e mais...

Exercícios - CeV Oferecido por Linguagens utilizadas atualmente O que vai encontrar aqui? 👀 Esse repositório é dedicado a armazenar todos os enunciad

Coding in Community 43 Nov 10, 2022
A tool to improve Boolean satisfiability (SAT) solver user's life

SatHelper This is a tool to improve the Boolean satisfiability (SAT) and MaxSAT solver user's life. It helps you model various problems as SAT and Max

Tomas Balyo 1 Nov 16, 2021
A web application (with multiple API project options) that uses MariaDB HTAP!

Bookings Bookings is a web application that, backed by the power of the MariaDB Connectors and the MariaDB X4 Platform, unleashes the power of smart t

MariaDB Corporation 4 Dec 28, 2022
Python implementation for Active Directory certificate abuse

Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS). Based on the C# variant Ce

Oliver Lyak 1.3k Jan 09, 2023