In this project we predict the forest cover type using the cartographic variables in the training/test datasets.

Overview

Kaggle Competition: Forest Cover Type Prediction

In this project we predict the forest cover type (the predominant kind of tree cover) using the cartographic variables given in the training/test datasets. You can find more about this project at Forest Cover Type Prediction.

This project and its detailed notebooks were created and published on Kaggle.

Project Objective

  • We are given raw unscaled data with both numerical and categorical variables.
  • First, we performed Exploratory Data Analysis in order to visualize the characteristics of our given variables.
  • We constructed various models to train our data - utilizing Optuna hyperparameter tuning to get parameters that maximize the model accuracies.
  • Using feature engineering techniques, we built new variables to help improve the accuracy of our models.
  • Using the strategies above, we built our final model and generated forest cover type predictions for the test dataset.

Links to Detailed Notebooks

EDA Summary

The purpose of the EDA is to provide an overview of how python visualization tools can be used to understand the complex and large dataset. EDA is the first step in this workflow where the decision-making process is initiated for the feature selection. Some valuable insights can be obtained by looking at the distribution of the target, relationship to the target and link between the features.

Visualize Numerical Variables

  • Using histograms, we can visualize the spread and values of the 10 numeric variables.
  • The Slope, Vertical Distance to Hydrology, Horizontal Distance to Hydrology, Roadways and Firepoints are all skewed right.
  • Hillshade 9am, Noon, and 3pm are all skewed left. visualize numerical variables histograms

Visualize Categorical Variables

  • The plots below the number of observations of the different Wilderness Areas and Soil Types.
  • Wilderness Areas 3 and 4 have the most presence.
  • Wilderness Area 2 has the least amount of observations.
  • The most observations are seen having Soil Type 10 followed by Soil Type 29.
  • The Soil Types with the least amount of observations are Soil Type 7 and 15. # of observations of wilderness areas # of observations of soil types

Feature Correlation

With the heatmap excluding binary variables this helps us visualize the correlations of the features. We were also able to provide scatterplots for four pairs of features that had a positive correlation greater than 0.5. These are one of the many visualization that helped us understand the characteristics of the features for future feature engineering and model selection.

heatmap scatterplots

Summary of Challenges

EDA Challenges

  • This project consists of a lot of data and can have countless of patterns and details to look at.
  • The training data was not a simple random sample of the entire dataset, but a stratified sample of the seven forest cover type classes which may not represent the final predictions well.
  • Creating a "story" to be easily incorporated into the corresponding notebooks such as Feature Engineering, Models, etc.
  • Manipulating the Wilderness_Area and Soil_Type (one-hot encoded variables) to visualize its distribution compared to Cover_Type.

Feature Engineering Challenges

  • Adding new variables during feature engineering often produced lower accuracy.
  • Automated feature engineering using entities and transformations amongst existing columns from a single dataset created many new columns that did not positively contribute to the model's accuracy - even after feature selection.
  • Testing the new features produced was very time consuming, even with the GPU accelerator.
  • After playing around with several different sets of new features, we found that only including manually created new features yielded the highest results.

Modeling Challenges

  • Ensemble and stacking methods initially resulted in models yielding higher accuracy on the test set, but as we added features and refined the parameters for each individual model, an individual model yielded a better score on the test set.
  • Performing hyperparameter tuning and training for several of the models was computationally expensive. While we were able to enable GPU acceleration for the XGBoost model, activating the GPU accelerator seemed to increase the tuning and training for the other models in the training notebook.
  • Optuna worked to reduce the time to process hyperparameter trials, but some of the hyperparameters identified through this method yielded weaker models than the hyperparameters identified through GridSearchCV. A balance between the two was needed.

Summary of Modeling Techniques

We used several modeling techniques for this project. We began by training simple, standard models and applying the predictions to the test set. This resulted in models with only 50%-60% accuracy, necessitating more complex methods. The following process was used to develop the final model:

  • Scaling the training data to perform PCA and identify the most important features (see the Feature_Engineering Notebook for more detail).
  • Preprocessing the training data to add in new features.
  • Performing GridSearchCV and using the Optuna approach (see the ModelParams Notebook for more detail) for identifying optimal parameters for the following models with corresponding training set accuracy scores:
    • Logistic Regression (.7126)
    • Decision Tree (.9808)
    • Random Forest (1.0)
    • Extra Tree Classifier (1.0)
    • Gradient Boosting Classifier (1.0)
    • Extreme Gradient Boosting Classifier (using GPU acceleration; 1.0)
    • AdaBoost Classifier (.5123)
    • Light Gradient Boosting Classifier (.8923)
    • Ensemble/Voting Classifiers (assorted combinations of the above models; 1.0)
  • Saving and exporting the preprocessor/scaler and each each version of the model with the highest accuracy on the training set and highest cross validation score (see the Training notebook for more detail).
  • Calculating each model's predictions for the test set and submitting to determine accuracy on the test set:
    • Logistic Regression (.6020)
    • Decision Tree (.7102)
    • Random Forest (.7465)
    • Extra Tree Classifier (.7962)
    • Gradient Boosting Classifier (.7905)
    • Extreme Gradient Boosting Classifier (using GPU acceleration; .7803)
    • AdaBoost Classifier (.1583)
    • Light Gradient Boosting Classifier (.6891)
    • Ensemble/Voting Classifier (assorted combinations of the above models; .7952)

Summary of Final Results

The model with the highest accuracy on the out of sample (test set) data was selected as our final model. It should be noted that the model with the highest accuracy according to 10-fold cross validation was not the most accurate model on the out of sample data (although it was close). The best model was the Extra Tree Classifier with an accuracy of .7962 on the test set. The Extra Trees model outperformed our Ensemble model (.7952), which had been our best model for several weeks. See the Submission Notebook and FinalModelEvaluation Notebook for additional detail.

Owner
Marianne Joy Leano
A recent graduate with a Master's in Data Science. Excited to explore data and create projects!
Marianne Joy Leano
Point Cloud Registration Network

PCRNet: Point Cloud Registration Network using PointNet Encoding Source Code Author: Vinit Sarode and Xueqian Li Paper | Website | Video | Pytorch Imp

ViNiT SaRoDe 59 Nov 19, 2022
object recognition with machine learning on Respberry pi

Respberrypi_object-recognition object recognition with machine learning on Respberry pi line.py 建立一支與樹梅派連線的 linebot 使用此 linebot 遠端控制樹梅派拍照 config.ini l

1 Dec 11, 2021
Code and models for "Pano3D: A Holistic Benchmark and a Solid Baseline for 360 Depth Estimation", OmniCV Workshop @ CVPR21.

Pano3D A Holistic Benchmark and a Solid Baseline for 360o Depth Estimation Pano3D is a new benchmark for depth estimation from spherical panoramas. We

Visual Computing Lab, Information Technologies Institute, Centre for Reseach and Technology Hellas 50 Dec 29, 2022
Calibrate your listeners! Robust communication-based training for pragmatic speakers. Findings of EMNLP 2021.

Calibrate your listeners! Robust communication-based training for pragmatic speakers Rose E. Wang, Julia White, Jesse Mu, Noah D. Goodman Findings of

Rose E. Wang 3 Apr 02, 2022
NeuroLKH: Combining Deep Learning Model with Lin-Kernighan-Helsgaun Heuristic for Solving the Traveling Salesman Problem

NeuroLKH: Combining Deep Learning Model with Lin-Kernighan-Helsgaun Heuristic for Solving the Traveling Salesman Problem Liang Xin, Wen Song, Zhiguang

xinliangedu 33 Dec 27, 2022
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

1.3k Jan 02, 2023
Zalo AI challenge 2021 task hum to song

Zalo AI challenge 2021 task Hum to Song pipeline: Chuẩn bị dữ liệu cho quá trình train: Sửa các file đường dẫn trong config/preprocess.yaml raw_path:

Vo Van Phuc 105 Dec 16, 2022
MediaPipe is a an open-source framework from Google for building multimodal

MediaPipe is a an open-source framework from Google for building multimodal (eg. video, audio, any time series data), cross platform (i.e Android, iOS, web, edge devices) applied ML pipelines. It is

Bhavishya Pandit 3 Sep 30, 2022
AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

AirPose AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation Check the teaser video This repository contains the code of A

Robot Perception Group 41 Dec 05, 2022
Reverse engineering Rosetta 2 in M1 Mac

Project Champollion About this project Rosetta 2 is an emulation mechanism to run the x86_64 applications on Arm-based Apple Silicon with Ahead-Of-Tim

FFRI Security, Inc. 258 Jan 07, 2023
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
Production First and Production Ready End-to-End Speech Recognition Toolkit

WeNet 中文版 Discussions | Docs | Papers | Runtime (x86) | Runtime (android) | Pretrained Models We share neural Net together. The main motivation of WeN

2.7k Jan 04, 2023
The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

STAR-FC This code is the implementation for the CVPR 2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes" 🌟 🌟 . 🎓 Re

Shuai Shen 87 Dec 28, 2022
A Topic Modeling toolbox

Topik A Topic Modeling toolbox. Introduction The aim of topik is to provide a full suite and high-level interface for anyone interested in applying to

Anaconda, Inc. (formerly Continuum Analytics, Inc.) 93 Dec 01, 2022
Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

PGpoints Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021) Hyeontae Son, Young Min Kim Pre

Hyeontae Son 9 Jun 06, 2022
Julia package for contraction of tensor networks, based on the sweep line algorithm outlined in the paper General tensor network decoding of 2D Pauli codes

Julia package for contraction of tensor networks, based on the sweep line algorithm outlined in the paper General tensor network decoding of 2D Pauli codes

Christopher T. Chubb 35 Dec 21, 2022
Catalyst.Detection

Accelerated DL R&D PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentatio

Catalyst-Team 12 Oct 25, 2021
The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

AICITY2021_Track2_DMT The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop. Introduction

Hao Luo 91 Dec 21, 2022
[ICRA 2022] CaTGrasp: Learning Category-Level Task-Relevant Grasping in Clutter from Simulation

This is the official implementation of our paper: Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. "CaTGrasp: Learning Category-Level Task-R

Bowen Wen 199 Jan 04, 2023