A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Last update: Dec 18, 2021

Related tags

Overview

FEATURE ENGINEERING

Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is expected that the dataset will be ready for modelling when passed through this script.

Story of the Dataset:
The dataset is the dataset of the people who were in the Titanic shipwreck. It consists of 768 observations and 12 variables. The target variable is specified as "Survived";

0: indicates the person's inability to survive.

1: refers to the survival of the person.

ATTRIBUTES:

PassengerId: ID of the passenger

Survived: Survival status (0: not survived, 1: survived)

Pclass: Ticket class (1: 1st class (upper), 2: 2nd class (middle), 3: 3rd class(lower))

Name: Name of the passenger

Sex: Gender of the passenger (male, female)

Age: Age in years

Sibsp: Number of siblings/spouses aboard the Titanic
Sibling = Brother, sister, stepbrother, stepsister
Spouse = Husband, wife (mistresses and fiances were ignored)

Parch: Number of parents/children aboard the Titanic
Parent = Mother, father
Child = Daughter, son, stepdaughter, stepson
Some children travelled only with a nanny , therefore Parch = 0 for them.

Ticket: Ticket number # Fare: Passenger fare

Cabin: Cabin number

Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

REFERENCE: Data Science and ML Boot Camp, 2021, Veri Bilimi Okulu (https://www.veribilimiokulu.com/)

A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Related tags

Overview

Owner

Pinar Oner

An easier way to build neural search on the cloud

Applied Machine Learning for Graduate Program in Computer Science (PPGCC)

A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Scikit-Learn useful pre-defined Pipelines Hub

This is a public repo where code samples are stored for the book Practical MLOps.

A scikit-learn based module for multi-label et. al. classification

moDel Agnostic Language for Exploration and eXplanation

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Neural Machine Translation (NMT) tutorial with OpenNMT-py

The Fuzzy Labs guide to the universe of open source MLOps

Dragonfly is an open source python library for scalable Bayesian optimisation.

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Fundamentals of Machine Learning

pymc-learn: Practical Probabilistic Machine Learning in Python

QML: A Python Toolkit for Quantum Machine Learning

Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

MLFlow in a Dockercontainer based on Azurite and Postgres