Decision Tree Regression algorithm implemented on Python from scratch.

Last update: Dec 22, 2021

Overview

Decision_Tree_Regression

I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when the dataset is a curved line. The algorithm uses decision trees to generate multiple regression lines recursively. The training dataset is split into two parts in each iteration and a regression line is fit. The split is made at the best possible point to minimize the Mean Squared Error (MSE).

The number of regression lines is key. Overfitting occurs if the number is too high and underfitting occurs if the number is too low. There are two hyperparameters we use in this algorithm, maximum depth of the decision trees and the minimum number of samples in a single split. These parameters should be tested and optimized for each dataset.

Creating Datasets

Instead of using datasets downloaded from the internet, I decided to create my own datasets for this project. I generated 4 datasets to test my algorithm: Noisy Sinusoidal Signal, Noisy Second Degree Polynomial, Noisy Linear Line and Noisy Upside Down Triangle Signal. The program generates these datasets when its run and saves the datasets to recreate the results. To generate new datasets, you simply need to delete the first dataset, dataset0.csv file. You can also use your own datasets by uploading them to the same directory as the Python project.

Plotting Results

You can see the results of the sinusoidal signal and the upside down triangle for various hyperparameters. Colored points represent the splits in the training dataset, black lines represent the linear regression line for the corresponding split and the larger gray points represent the test dataset.

It is observed that for these datasets the best value for maximum depth is 4.

Decision Tree Regression algorithm implemented on Python from scratch.

Related tags

Overview

Decision_Tree_Regression

Creating Datasets

Plotting Results

Owner

Relevance Vector Machine implementation using the scikit-learn API.

About Solve CTF offline disconnection problem - based on python3's small crawler

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

Python package for causal inference using Bayesian structural time-series models.

A repository to index and organize the latest machine learning courses found on YouTube.

Responsible AI Workshop: a series of tutorials & walkthroughs to illustrate how put responsible AI into practice

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers

Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Distributed scikit-learn meta-estimators in PySpark

Extreme Learning Machine implementation in Python

🌊 River is a Python library for online machine learning.

Deep Survival Machines - Fully Parametric Survival Regression

Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

Factorization machines in python

AP1 Transcription Factor Binding Site Prediction

Python module for data science and machine learning users.