Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Overview

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Last updated on January 30, 2022 by Thomas J. Nicoletti

I would like to preface this document by stating this is my second major project using Python. From my first project to now, I certainly improved upon my understanding of and proficiency with Python, though I still have a long journey ahead of me. I aim to keep learning more and more everyday, and hope this project provides some benefit to the greater applied social science community.

The purpose of this data mining script is to use random forest classification, in conjunction with factor analysis and other analytic techniques, to automatically yield feature importance metrics and related output for a driver analysis. Driver analysis quantifies the importance of independent variables (i.e., drivers) in predicting some outcome variable. Within this repository is a basic, simulated dataset created by me, containing five independent variables and one outcome variable. I am by no means an expert in simulating datasets, so I encourage everyone to use real-world data as a stress test for this statistical tool.

This tool will communicate with users using simple inputs via the Command Prompt. Once all mandatory and optional inputs are received, the analysis will run and send relevant information to the source folder; this potentially includes text files, images, and data files useful for model comprehension and validation, as well as statistically- and conceptually-informed decision-making. The most useful outputs will include the automatically generated feature importance plot and feature quadrant chart.

💻 Installation and Preparation

Please note that excerpts of code provided below are examples based on the driver.py script. As a self-taught programmer, I suggest reading through my insights, mixing them with a quick Google search and your own experiences, and then delving into the script itself.

For this project, I used Python 3.9, the Microsoft Windows operating system, and Microsoft Excel. As such, these act as the prerequisites for utilizing this repository successfully without any additional troubleshooting. Going forward, please ensure everything you download or install for this project ends up in the correct location (e.g., the same source folder).

Use pip to install relevant packages to the proper source folder using the Command Prompt and correct PATH. For example:

pip install numpy
pip install pandas

Please be sure to install each of the following packages: easygui, matplotlib, numpy, pandas, seaborn, string, factor_analyzer, scipy, sklearn, and statsmodels. If required, use the first section of the script to determine lacking dependencies, and proceed accordingly.

📑 Script Breakdown

The script begins by calling relevant libraries in Python, as well as defining Mahalanobis distance, which is used to identify multivariate outliers in a later step of this project. Additionally, the Command Prompt will read a simple set of instructions for the user, including important information regarding categorical features, the location of the outcome variable within the dataset, and a required revision for missing data. Furthermore, the script will allow the user to specify a random seed for easy replication of this driver analysis at a later date:

import easygui
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
...
def mahalanobis(x = None, data = None, cov = None):
	mu = x - np.mean(data)
    ...
	return mah.diagonal()
...
seed = int(input('Please enter your numerical random seed for replication purposes: '))
np.random.seed(seed)
text = open('random_seed.txt', 'w')

The script has an entire section dedicated to understanding your dataset, including a quick process for uploading your data file, removing missing data, adding an outlier status variable, determining the final sample size, classifying variables, and so on:

df = pd.read_csv(easygui.fileopenbox())
df.dropna(inplace = True)
df['Mahalanobis'] = mahalanobis(x = df, data = df.iloc[:, :len(df.columns)], cov = None)
df['PValue'] = 1 - chi2.cdf(df['Mahalanobis'], len(df.columns) - 1)
...
n = df.shape[0]
text = open('sample_size.txt', 'w')
...
x = df.iloc[:, :-1]
y = np.ravel(df.iloc[:, -1])
feat = df.columns[:-1]
mean = np.array(df.describe().loc['mean'][:-1])

The script then checks for relevant statistical assumptions needed before determining if your dataset is appropriate for factor analysis. This includes Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin Test. Additionally, a scree plot is produced using principal components analysis to assist in factor analysis decision-making. Once all of this is reviewed, the user will provide relevant inputs regarding their driver analysis model:

bart = calculate_bartlett_sphericity(x)
bart = (str(round(bart[1], 2)))
text = open('sphericity.txt', 'w')
...
kmo = calculate_kmo(x)
kmo = (str(round(kmo[1], 2)))
text = open('factorability.txt', 'w')
...
pca = PCA()
pca.fit(x)
comp = np.arange(pca.n_components_)
plt.figure()

When it comes to choosing whether to run random forest classification on the original variables or transformed factors, the above information is critical. The user will be able to decide both A) whether or not to use factor analysis, and B) how many factors should be used in extraction if applicable. Additionally, if the user opts for the factor analysis route, they will also be able to determine whether all the factors or just the highest loading variable per factor should be used (please see lines 139-150 in the script). The following optional factor analysis and mandatory core analyses will run based on user specifications from the previous step:

fa = FactorAnalysis(n_components = factor, max_iter = 3000, rotation = 'varimax')
...
x = fa.transform(x)
...
load = pd.DataFrame(fa.components_.T.round(2), columns = cols, index = feat)
load.to_csv('factor_loadings.csv')
...
vif = pd.Series(variance_inflation_factor(x.values, i) for i in range(x.shape[1]))
vif = pd.DataFrame(np.array(vif.round(2)), columns = ['Variable Inflation Factor'], index = feat)
vif.T.to_csv('variable_inflation_factors.csv')
clf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_features = 'auto', bootstrap = True, oob_score = True, class_weight = 'balanced').fit(x, y)
oob = str(round(clf.oob_score_, 2)).ljust(4, '0')
pred = clf.predict_proba(x)
loss = str(round(log_loss(y, pred), 2)).ljust(4, '0')
perf = pd.DataFrame({'Out-of-Bag Score': oob, 'Log Loss': loss}, index = ['Estimate'])
perf.to_csv('model_performance.csv')

Please note, the only current rotation method available in Python for factor analysis is varimax, as far as I know. If another rotation method is preferred, I would opt out of the factor analysis route, or try implementing your own solution from scratch. From these results, the feature importance plot and its respective feature quadrant chart can be graphed and saved automatically to the source folder. This is an especially useful and efficient data visualization tool to help express which variable(s) are most important in predicting your outcome. It also saves you quite a bit of time compared to graphing it yourself!

imp = clf.feature_importances_
sort = np.argsort(imp)
plt.figure()
plt.barh(range(len(sort)), imp[sort], color = 'mediumaquamarine', align = 'center')
plt.title('Feature Importance Plot')
plt.xlabel('Derived Importance →')
...
imps = []
score = []
for i, feat in enumerate(imp[sort]):
  imps.append(round(feat / imp[sort].mean() * 100, 0))
for i, feat in enumerate(mean[sort]):
  score.append(round(feat / mean[sort].mean() * 100, 0))
quad = pd.DataFrame({'Rescaled Observed Score →': score, 'Rescaled Derived Importance →': imps,
  'Feature': x.columns[sort]})

To run the script, I suggest using a batch file located in the source folder as follows:

python driver.py
PAUSE

Although the entire script is not reflected in the above breakdown, this information should prove helpful in getting the user accustomed to what this script aims to achieve. If any additional information and/or explanations are desired, please do not hesitate in reaching out!

📋 Next Steps

Although I feel this project is solid in its current state, I think one area of improvement would fall in the realm of optimizing the script and making it more pythonic. I am also quite interested in hearing feedback from users, including their field of practice, which variables they used for their analyses, and how satisfied they were with this statistical tool overall.

💡 Community Contribution

I am always happy to receive feedback, recommendations, and/or requests from anyone, especially new learners. Please click here for information about the license for this project.

Project Support

Please let me know if you plan to make changes to this project, or adapt the script to a project of your own interest. We can certainly collaborate to make this process as painless as possible!

📚 Additional Resources

  • My current work in market research introduced me to the idea of driver analysis and its usefulness; this statistical tool was created with that space in mind, though it is certainly applicable to all applied areas of business and social science
  • To learn more about calculating random forest classification in Python, click here to access scikit-learn
  • To learn more about calculating factor analysis in Python, click here to access scikit-learn
  • For easy-to-use text editing software, check out Sublime Text for Python and Atom for Markdown
Owner
Thomas
With a passion for research, I am eager to build upon my knowledge of statistical programming. My current areas of focus include data mining and psychometrics.
Thomas
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

Facebook Research 208 Dec 22, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

Bhavya Gopal 3 Jan 31, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Vectorizers for a range of different data types

Vectorizers for a range of different data types

Tutte Institute for Mathematics and Computing 69 Dec 29, 2022