Tools for exploratory data analysis in Python

Overview

Dora

Exploratory data analysis toolkit for Python.

Contents

Summary

Dora is a Python library designed to automate the painful parts of exploratory data analysis.

The library contains convenience functions for data cleaning, feature selection & extraction, visualization, partitioning data for model validation, and versioning transformations of data.

The library uses and is intended to be a helpful addition to common Python data analysis tools such as pandas, scikit-learn, and matplotlib.

Setup

$ pip3 install Dora
$ python3
>>> from Dora import Dora

Usage

Reading Data & Configuration

# without initial config
>>> dora = Dora()
>>> dora.configure(output = 'A', data = 'path/to/data.csv')

# is the same as
>>> import pandas as pd
>>> dataframe = pd.read_csv('path/to/data.csv')
>>> dora = Dora(output = 'A', data = dataframe)

>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

Cleaning

# read data with missing and poorly scaled values
>>> import pandas as pd
>>> df = pd.DataFrame([
...   [1, 2, 100],
...   [2, None, 200],
...   [1, 6, None]
... ])
>>> dora = Dora(output = 0, data = df)
>>> dora.data
   0   1    2
0  1   2  100
1  2 NaN  200
2  1   6  NaN

# impute the missing values (using the average of each column)
>>> dora.impute_missing_values()
>>> dora.data
   0  1    2
0  1  2  100
1  2  4  200
2  1  6  150

# scale the values of the input variables (center to mean and scale to unit variance)
>>> dora.scale_input_values()
>>> dora.data
   0         1         2
0  1 -1.224745 -1.224745
1  2  0.000000  1.224745
2  1  1.224745  0.000000

Feature Selection & Extraction

# feature selection / removing a feature
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

>>> dora.remove_feature('useless_feature')
>>> dora.data
   A   B  C      D
0  1   2  0   left
1  4 NaN  1  right
2  7   8  2   left

# extract an ordinal feature through one-hot encoding
>>> dora.extract_ordinal_feature('D')
>>> dora.data
   A   B  C  D=left  D=right
0  1   2  0       1        0
1  4 NaN  1       0        1
2  7   8  2       1        0

# extract a transformation of another feature
>>> dora.extract_feature('C', 'twoC', lambda x: x * 2)
>>> dora.data
   A   B  C  D=left  D=right  twoC
0  1   2  0       1        0     0
1  4 NaN  1       0        1     2
2  7   8  2       1        0     4

Visualization

# plot a single feature against the output variable
dora.plot_feature('column-name')

# render plots of each feature against the output variable
dora.explore()

Model Validation

# create random partition of training / validation data (~ 80/20 split)
dora.set_training_and_validation()

# train a model on the data
X = dora.training_data[dora.input_columns()]
y = dora.training_data[dora.output]

some_model.fit(X, y)

# validate the model
X = dora.validation_data[dora.input_columns()]
y = dora.validation_data[dora.output]

some_model.score(X, y)

Data Versioning

# save a version of your data
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.snapshot('initial_data')

# keep track of changes to data
>>> dora.remove_feature('useless_feature')
>>> dora.extract_ordinal_feature('D')
>>> dora.impute_missing_values()
>>> dora.scale_input_values()
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107

>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

# use a previous version of the data
>>> dora.snapshot('transform1')
>>> dora.use_snapshot('initial_data')
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.logs
[]

# switch back to your transformation
>>> dora.use_snapshot('transform1')
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107
>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

Testing

To run the test suite, simply run python3 spec.py from the Dora directory.

Contribute

Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.

Additionally, feel free to submit pull requests which add features or address bugs yourself.

License

The MIT License (MIT)

Copyright (c) 2016 Nathan Epstein

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
Nathan Epstein
Nathan Epstein
mysql relation charts

sqlcharts 自动生成数据库关联关系图 复制settings.py.example 重命名为settings.py 将数据库配置信息填入settings.DATABASE,目前支持mysql和postgresql 执行 python build.py -b,-b是读取数据库表结构,如果只更新匹

6 Aug 22, 2022
kyle's vision of how datadog's python client should look

kyle's datadog python vision/proposal not for production use See examples/comprehensive.py for a mostly working example of the proposed API. 📈 🐶 ❤️

Kyle Verhoog 2 Nov 21, 2021
Create matplotlib visualizations from the command-line

MatplotCLI Create matplotlib visualizations from the command-line MatplotCLI is a simple utility to quickly create plots from the command-line, levera

Daniel Moura 46 Dec 16, 2022
A tool for automatically generating 3D printable STLs from freely available lidar scan data.

mini-map-maker A tool for automatically generating 3D printable STLs from freely available lidar scan data. Screenshots Tutorial To use this script, g

Mike Abbott 51 Nov 06, 2022
A curated list of awesome Dash (plotly) resources

Awesome Dash A curated list of awesome Dash (plotly) resources Dash is a productive Python framework for building web applications. Written on top of

Luke Singham 1.7k Jan 07, 2023
JupyterHub extension for ContainDS Dashboards

ContainDS Dashboards for JupyterHub A Dashboard publishing solution for Data Science teams to share results with decision makers. Run a private on-pre

Ideonate 179 Nov 29, 2022
A tool for creating Toontown-style nametags in Panda3D

Toontown-Nametag Toontown-Nametag is a tool for creating Toontown Online/Toontown Rewritten-style nametags in Panda3D. It contains a function, createN

BoggoTV 2 Dec 23, 2021
Browse Dash docsets inside emacs

Helm Dash What's it This package uses Dash docsets inside emacs to browse documentation. Here's an article explaining the basic usage of it. It doesn'

504 Dec 15, 2022
The repository is my code for various types of data visualization cases based on the Matplotlib library.

ScienceGallery The repository is my code for various types of data visualization cases based on the Matplotlib library. It summarizes the code and cas

Warrick Xu 2 Apr 20, 2022
Lightweight, extensible data validation library for Python

Cerberus Cerberus is a lightweight and extensible data validation library for Python. v = Validator({'name': {'type': 'string'}}) v.validate({

eve 2.9k Dec 27, 2022
Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.

AutoViz Automatically Visualize any dataset, any size with a single line of code. AutoViz performs automatic visualization of any dataset with one lin

AutoViz and Auto_ViML 1k Jan 02, 2023
script to generate HeN ipfs app exports of GLSL shaders

HeNerator A simple script to generate HeN ipfs app exports from any frag shader created with: GlslViewer GlslEditor The Book of Shaders glslCanvas VS

Patricio Gonzalez Vivo 22 Dec 21, 2022
:bowtie: Create a dashboard with python!

Installation | Documentation | Gitter Chat | Google Group Bowtie Introduction Bowtie is a library for writing dashboards in Python. No need to know we

Jacques Kvam 753 Dec 22, 2022
Color scales in Python for humans

colorlover Color scales for humans IPython notebook: https://plot.ly/ipython-notebooks/color-scales/ import colorlover as cl from IPython.display impo

Plotly 146 Sep 25, 2022
Open-source demos hosted on Dash Gallery

Dash Sample Apps This repository hosts the code for over 100 open-source Dash apps written in Python or R. They can serve as a starting point for your

Plotly 2.7k Jan 07, 2023
Blender addon that creates a temporary window of any type from the 3D View.

CreateTempWindow2.8 Blender addon that creates a temporary window of any type from the 3D View. Features Can the following window types: 3D View Graph

3 Nov 27, 2022
GD-UltraHack - A Mod Menu for Geometry Dash. Specifically a MegahackV5 clone in Python. Only for Windows

GD UltraHack: The Mod Menu that Nobody asked for. This is a mod menu for the gam

zeo 1 Jan 05, 2022
Pydrawer: The Python package for visualizing curves and linear transformations in a super simple way

pydrawer 📐 The Python package for visualizing curves and linear transformations in a super simple way. ✏️ Installation Install pydrawer package with

Dylan Tintenfich 56 Dec 30, 2022
Jupyter notebook and datasets from the pandas Q&A video series

Python pandas Q&A video series Read about the series, and view all of the videos on one page: Easier data analysis in Python with pandas. Jupyter Note

Kevin Markham 2k Jan 05, 2023
Create charts with Python in a very similar way to creating charts using Chart.js

Create charts with Python in a very similar way to creating charts using Chart.js. The charts created are fully configurable, interactive and modular and are displayed directly in the output of the t

Nicolas H 68 Dec 08, 2022