An extension to pandas dataframes describe function.

Last update: Dec 30, 2022

Overview

pandas_summary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

properties
- dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
- dsf.columns_types: a count of the types of columns
- dfs[column]: more in depth summary of the column
function
- summary(): extends the describe() function with the values with columns_stats

Installation

The module can be easily installed with pip:

> pip install pandas-summary

This module depends on numpy and pandas. Optionally you can get also some nice visualisations if you have matplotlib installed.

Tests

To run the tests, execute the command python setup.py test

Usage

The module contains one class:

DataFrameSummary

The DataFrameSummary expect a pandas DataFrame to summarise.

from pandas_summary import DataFrameSummary

dfs = DataFrameSummary(df)

getting the columns types

dfs.columns_types


numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64

getting the columns stats

dfs.columns_stats


                      A            B        C              D              E 
counts             5802         5794     5781           5781           4617   
uniques            5802            3     5771            128            121   
missing               0            8       21             21           1185   
missing_perc         0%        0.14%    0.36%          0.36%         20.42%   
types            unique  categorical  numeric        numeric        numeric

getting a single column summary, e.g. numerical column

# we can also access the column using numbers A[1]
dfs['A']

std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object

Future development

Summary analysis between columns, i.e. dfs[[1, 2]]

An extension to pandas dataframes describe function.

Related tags

Overview

pandas_summary

Installation

Tests

Usage

DataFrameSummary

Future development

Owner

Mourad

Lale is a Python library for semi-automated data science.

Python Practicum - prepare for your Data Science interview or get a refresher.

Investigating EV charging data

A Numba-based two-point correlation function calculator using a grid decomposition

Project: Netflix Data Analysis and Visualization with Python

Python reader for Linked Data in HDF5 files

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Flexible HDF5 saving/loading and other data science tools from the University of Chicago

The micro-framework to create dataframes from functions.

Data Analytics on Genomes and Genetics

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

Developed for analyzing the covariance for OrcVIO

My first Python project is a simple Mad Libs program.