ML Kaggle Titanic Problem using LogisticRegrission

Overview

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission

here you will find the solution for the titanic problem on kaggle with comments and step by step coding



Problem Overview

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).


Table of Contents
  1. Analuze and visilaze the Dataset
  2. Clean and prepare the dataset for our ML model
  3. Build & Train Our Model
  4. Caluclate the Accuracy for the model
  5. Prepare the submission file to submit it to kaggle

Load & Analyze Our Dataset

  • First we read the data from the csv files
    data_train = pd.read_csv('titanic/train.csv')
    data_test = pd.read_csv('titanic/test.csv')

visilyze the given data

   print(data_train.head())
PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S   

## Note ```sh The Survived column is what we’re trying to predict. We call this column the (target) and remaining columns are called (features) ```
### count the number of the Survived and the deaths ```py data_train['Survived'].value_counts() # (342 Survived) | (549 not survived) ```

plot the amount of the survived and the deaths

plt.figure(figsize=(5, 5))
plt.bar(list(data_train['Survived'].value_counts().keys()), (list(data_train['Survived'].value_counts())),
     color=['r', 'g'])

analyze the age

plt.figure(figsize=(5, 7))
plt.hist(data_train['Age'], color='Purple')
plt.title('Age Distribuation')
plt.xlabel('Age')
plt.show()


Note: Now after we made some analyze here and their, it's time to clean up our data If you take a look to the avalible columns we you may noticed that some columns are useless so they may affect on our model performance.

Here we make our cleaning function

   def clean(data):
    # here we drop the unwanted data
    data = data.drop(['Ticket', 'Cabin', 'Name'], axis=1)
    cols = ['SibSp', 'Parch', 'Fare', 'Age']

    # Fill the Null Values with the mean value
    for col in cols:
        data[col].fillna(data[col].mean(), inplace=True)

    # fill the Embarked null values with an unknown data
    data.Embarked.fillna('U', inplace=True)
    return data

# now we call our function and start cleaning!

data_train = clean(data_train)
data_test = clean(data_test)

## Note: now we need to change the sex feature into a numeric value like [1] for male and [0] female and also for the Embarked feature

Here we used preprocessing method in sklearn to do this job

le = preprocessing.LabelEncoder()
cols = ['Sex', 'Embarked'].predic
for col in cols:
    data_train[col] = le.fit_transform(data_train[col])
    data_test[col] = le.fit_transform(data_test[col])

## now our data is ready! it's time to build our model

we select the target column ['Survived'] to store it in [Y] and drop it from the original data

y = data_train['Survived']
x = data_train.drop('Survived', axis=1)

Here split our data

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.02, random_state=10)

Init the model

model = LogisticRegression(random_state=0, max_iter=10000)

train our model

model.fit(x_train, y_train)
predictions = model.predict(x_val)

## Great !!! our model is now finished and ready to use

It's time to check the accuracy for our model

print('Accuracy=', accuracy_score(y_val, predictions))

Output:

Accuracy=0.97777

Now we submit our model to kaggle

test = pd.read_csv('titanic/test.csv')
df = pd.DataFrame({'PassengerId': test['PassengerId'].values, 'Survived': submit_pred})
df.to_csv('submit_this_file.csv', index=False)
Owner
Mahmoud Nasser Abdulhamed
Mahmoud Nasser Abdulhamed
Microsoft 5.6k Jan 07, 2023
Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementation of different ML Algorithms from scratch, written in Python 3.x

Gautam J 393 Nov 29, 2022
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
Primitives for machine learning and data science.

An Open Source Project from the Data to AI Lab, at MIT MLPrimitives Pipelines and primitives for machine learning and data science. Documentation: htt

MLBazaar 65 Dec 29, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 02, 2023
TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

MilaGraph 1.1k Jan 08, 2023
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

7 Nov 18, 2021
MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

SUPSI-DACD-ISAAC 61 Dec 19, 2022
Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

AutoML-Freiburg-Hannover 6.7k Jan 07, 2023
Scikit-learn compatible wrapper of the Random Bits Forest program written by (Wang et al., 2016)

sklearn-compatible Random Bits Forest Scikit-learn compatible wrapper of the Random Bits Forest program written by Wang et al., 2016, available as a b

Tamas Madl 8 Jul 24, 2021
Timeseries analysis for neuroscience data

=================================================== Nitime: timeseries analysis for neuroscience data ===============================================

NIPY developers 212 Dec 09, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
A demo project to elaborate how Machine Learn Models are deployed on production using Flask API

This is a salary prediction website developed with the help of machine learning, this makes prediction of salary on basis of few parameters like interview score, experience test score.

1 Feb 10, 2022
A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Machine Learning Conference & Summer School Notes. 🦄📝🎉

558 Dec 28, 2022
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 09, 2023
ML-powered Loan-Marketer Customer Filtering Engine

In Loan-Marketing business employees are required to call the user's to buy loans of several fields and in several magnitudes. If employees are calling everybody in the network it is also very length

Sagnik Roy 13 Jul 02, 2022
Official code for HH-VAEM

HH-VAEM This repository contains the official Pytorch implementation of the Hierarchical Hamiltonian VAE for Mixed-type Data (HH-VAEM) model and the s

Ignacio Peis 8 Nov 30, 2022
SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

Amazon Web Services 1.8k Jan 01, 2023
LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022