Explorative Data Analysis Guidelines

Overview

Explorative Data Analysis

Get data into a usable format!
Find out if the following predictive modeling phase will be successful!

  • Combine everything into a single big table
    • Convert files to .csv
    • Merge files
    • Fix encoding issues
    • Clean column names (english, no whitespace, no special chars)
    • Are there duplicate columns?
    • Fix datatypes (datetime, int, float, string)
  • Look at the raw data
    • Sort data
    • Filter data by various criteria
  • Investigation
    • Non-sensical observations/artifacts?
    • Coding of categorical features?
    • Missing values?
    • Outliers?
    • Constant values (=Zero Importance)?
    • Low importance features?
    • Collinear, correlated or otherwise dependent features?
    • Highly skewed features?
    • Irrelevant features?
  • Univariate Analysis
    • Look at mean, median, min, max, std, iqr, quantiles (1%, 5%, 25%, 50%, 75%, 95%, 99%)
    • Draw boxplots, histograms
  • Multivariate Analysis
    • Draw scatter plots
    • Create correlation matrix
  • Time Series? -> Plot variables over time
  • Fixing issues
    • Impute missing values (mode, median, mean)
    • Remove variables that have too many missings
    • Remove observations that have too many missings
    • Select appropriate time slice
  • Preparation
    • Clip values that are too small/too large
    • Scale to [0,1] or normalize (mean=0, std=1) or Robust / Quantile Scaling
    • One-hot encoding, Label Encoding (0,1,2,3)
    • Create log-transformed versions for highly skewed variables
    • Create binned versions for variables
    • Combine categories for highly skewed categorical variables
    • Create sum/difference/product/quotient of variables
    • Create polynomial features
Owner
Florian Rohrer
Florian Rohrer
EasyMultiClipboard - Python script written to handle more than 1 string in clipboard

EasyMultiClipboard - Python script written to handle more than 1 string in clipboard

WVlab 1 Jun 18, 2022
A web app builds using streamlit API with python backend to analyze and pick insides from multiple data formats.

Data-Analysis-Web-App Data Analysis Web App can analysis data in multiple formates(csv, txt, xls, xlsx, ods, odt) and gives shows you the analysis in

Kumar Saksham 19 Dec 09, 2022
PyPresent - create slide presentations from notes

PyPresent Create slide presentations from notes Add some formatting to text file

1 Jan 06, 2022
Yet Another MkDocs Parser

yamp Motivation You want to document your project. You make an effort and write docstrings. You try Sphinx. You think it sucks and it's slow -- I did.

Max Halford 10 May 20, 2022
YAML metadata extension for Python-Markdown

YAML metadata extension for Python-Markdown This extension adds YAML meta data handling to markdown with all YAML features. As in the original, metada

Nikita Sivakov 14 Dec 30, 2022
🐱‍🏍 A curated list of awesome things related to Hugo themes.

awesome-hugo-themes Automated deployment @ 2021-10-12 06:24:07 Asia/Shanghai &sorted=updated Theme Author License GitHub Stars Updated Blonde wamo MIT

13 Dec 12, 2022
Code and pre-trained models for "ReasonBert: Pre-trained to Reason with Distant Supervision", EMNLP'2021

ReasonBERT Code and pre-trained models for ReasonBert: Pre-trained to Reason with Distant Supervision, EMNLP'2021 Pretrained Models The pretrained mod

SunLab-OSU 29 Dec 19, 2022
ReStructuredText and Sphinx bridge to Doxygen

Breathe Packagers: PGP signing key changes for Breathe = v4.23.0. https://github.com/michaeljones/breathe/issues/591 This is an extension to reStruct

Michael Jones 643 Dec 31, 2022
A `:github:` role for Sphinx

sphinx-github-role A github role for Sphinx. Usage Basic usage MyST: :caption: index.md See {github}`astrojuanlu/sphinx-github-role#1`. reStructuredT

Juan Luis Cano Rodríguez 4 Nov 22, 2022
Modified fork of CPython's ast module that parses `# type:` comments

Typed AST typed_ast is a Python 3 package that provides a Python 2.7 and Python 3 parser similar to the standard ast library. Unlike ast up to Python

Python 217 Dec 06, 2022
A Python package develop for transportation spatio-temporal big data processing, analysis and visualization.

English 中文版 TransBigData Introduction TransBigData is a Python package developed for transportation spatio-temporal big data processing, analysis and

Qing Yu 251 Jan 03, 2023
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

0 Dec 13, 2022
Explain yourself! Interrogate a codebase for docstring coverage.

interrogate: explain yourself Interrogate a codebase for docstring coverage. Why Do I Need This? interrogate checks your code base for missing docstri

Lynn Root 435 Dec 29, 2022
SCTYMN is a GitHub repository that includes some simple scripts(currently only python scripts) that can be useful.

Simple Codes That You Might Need SCTYMN is a GitHub repository that includes some simple scripts(currently only python scripts) that can be useful. In

CodeWriter21 2 Jan 21, 2022
A document format conversion service based on Pandoc.

reformed Document format conversion service based on Pandoc. Usage The API specification for the Reformed server is as follows: GET /api/v1/formats: L

David Lougheed 3 Jul 18, 2022
The mitosheet package, trymito.io, and other public Mito code.

Mito Monorepo Mito is a spreadsheet that lives inside your JupyterLab notebooks. It allows you to edit Pandas dataframes like an Excel file, and gener

Mito 1.4k Dec 31, 2022
Read write method - Read files in various types of formats

一个关于所有格式文件读取的方法 1。 问题描述: 各种各样的文件格式,读写操作非常的麻烦,能够有一种方法,可以整合所有格式的文件,方便用户进行读取和写入。 2

2 Jan 26, 2022
Seamlessly integrate pydantic models in your Sphinx documentation.

Seamlessly integrate pydantic models in your Sphinx documentation.

Franz Wöllert 71 Dec 26, 2022
Version bêta d'un système pour suivre les prix des livres chez Books to Scrape,

Version bêta d'un système pour suivre les prix des livres chez Books to Scrape, un revendeur de livres en ligne. En pratique, dans cette version bêta, le programme n'effectuera pas une véritable surv

Mouhamed Dia 1 Jan 06, 2022
xeuledoc - Fetch information about a public Google document.

xeuledoc - Fetch information about a public Google document.

Malfrats Industries 651 Dec 27, 2022