Mining the Stack Overflow Developer Survey

Last update: Nov 16, 2021

Related tags

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

python 3.8.1 and up
pandas 1.3.4 and up
matplotlib 3.4.3 and up
numpy 1.21.0 and up
sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

Andrew Kraynak (LinkedIn, Github)
Samuel Kaczynski (LinkedIn, Github)

Mining the Stack Overflow Developer Survey

Related tags

Overview

Mining the Stack Overflow Developer Survey

Objectives

Usage

Dependencies

Methodology

Preprocessing

Models

Decision tree model

Random forest model

Results and Analysis

Authors

Owner

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Describing statistical models in Python using symbolic formulas

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

NumPy aware dynamic Python compiler using LLVM

A tax calculator for stocks and dividends activities.

Calculate multilateral price indices in Python (with Pandas and PySpark).

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

Data Science Environment Setup in single line

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

A tool to compare differences between dataframes and create a differences report in Excel

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

We're Team Arson and we're using the power of predictive modeling to combat wildfires.

A forecasting system dedicated to smart city data

Flenser is a simple, minimal, automated exploratory data analysis tool.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data