Missing data visualization module for Python.

Overview

missingno PyPi version t

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

Quickstart Binder

This quickstart uses a sample of the NYPD Motor Vehicle Collisions Dataset dataset. To get the data yourself, run the following on your command line:

$ pip install quilt
$ quilt install ResidentMario/missingno_data

Then to load the data into memory:

>>> from quilt.data.ResidentMario import missingno_data
>>> collisions = missingno_data.nyc_collision_factors()
>>> collisions = collisions.replace("nan", np.nan)

The rest of this walkthrough will draw from this collisions dataset. I additionally define nullity to mean whether a particular variable is filled in or not.

Matrix

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

>>> import missingno as msno
>>> %matplotlib inline
>>> msno.matrix(collisions.sample(250))

alt text

At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be completely populated, while geographic information seems mostly complete, but spottier.

The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

If you are working with time-series data, you can specify a periodicity using the freq keyword parameter:

>>> null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
>>> null_pattern = pd.DataFrame(null_pattern).replace({False: None})
>>> msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

alt text

Bar Chart

msno.bar is a simple visualization of nullity by column:

>>> msno.bar(collisions.sample(1000))

alt text

You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.

Heatmap

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

>>> msno.heatmap(collisions)

alt text

In this example, it seems that reports which are filed with an OFF STREET NAME variable are less likely to have complete geographic data.

Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.

Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between VEHICLE CODE TYPE 3 and CONTRIBUTING FACTOR VEHICLE 3 is <1, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.

The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power is limited when it comes to larger relationships and it has no particular support for extremely large datasets.

Dendrogram

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:

>>> msno.dendrogram(collisions)

alt text

The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity (for example, as CONTRIBUTING FACTOR VEHICLE 2 and VEHICLE TYPE CODE 2 ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

Configuration

For more advanced configuration details for your plots, refer to the CONFIGURATION.md file in this repository.

Contributing

For thoughts on features or bug reports see Issues. If you're interested in contributing to this library, see details on doing so in the CONTRIBUTING.md file in this repository.

Citation

You may cite this package using the following format (via this paper):

Bilogur, (2018). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547, https://doi.org/10.21105/joss.00547

Comments
  • Histogram of data completeness by column

    Histogram of data completeness by column

    First, great package!

    The data completeness shows the completeness of the data over rows, I'm requesting a way to show the data completeness over the columns. Maybe a sparkline/histogram below the bottom row?

    ]

    enhancement 
    opened by jbwhit 9
  • Introduce orientation to bar plot

    Introduce orientation to bar plot

    The dendrogram has an option to change the orientation from top-down to left-right. This is convenient when having a large amount of columns. The bar plot misses this option. To make the bar plot effective for large numbers of columns, it should have a similar behaviour.

    This pull requests adds the basic functionality. The figsize is not yet adjusted, this should be added.

    opened by sbrugman 8
  • option for grouping the columns by similarity?

    option for grouping the columns by similarity?

    Hi, I like the idea of this package very much. Would it be much of a work to implement an automatic grouping of the features (and maybe subjects) based on similarity? this way one can see if the missing are random or there is some pattern...

    enhancement 
    opened by EnricoGiampieri 7
  • ValueError: The number of FixedLocator locations does not match the number of ticklabels

    ValueError: The number of FixedLocator locations does not match the number of ticklabels

    when I plot msno.bar(df_raw.sample(10)), it gives me a valuerror: "ValueError: The number of FixedLocator locations (0), usually from a call to set_ticks, does not match the number of ticklabels (186)"

    how can i fix this? fix the fixedLocator location? but how?

    bug 
    opened by ywsyws 6
  • Legend in heatmap

    Legend in heatmap

    I did a heatmap on the titanic dataset and it is a little hard to interpret. Some of the cells have numbers in them while others don't. I have no idea what the colors represent as there is no legend to interpret them.

    screen shot 2018-12-06 at 6 04 52 pm

    opened by mattharrison 6
  • subplotting

    subplotting

    What is the easiest way to have subplots of msno.matrix() for df1 , df2, df3 . I already checked this issue and didn't work due to TypeError 'AxesSubplot' object does not support indexing and updated pandas and missingno based on this issue. Please leave a general solution we can just by replacing different df we can use it. Have a nice evening

    opened by clevilll 5
  • Module not found error

    Module not found error

    I keep getting this error message on jupyter notebook while running python 3.6. I stopped the kernel and restarted to check, but it won't. And suggestions?


    ModuleNotFoundError Traceback (most recent call last) in () ----> 1 import missingno as msno 2 get_ipython().magic('matplotlib inline') 3 msno.matrix(collisions.sample(250))

    ModuleNotFoundError: No module named 'missingno'

    opened by vhkrish 5
  • Include smaller example data for users to follow along (and for future tests)

    Include smaller example data for users to follow along (and for future tests)

    This package is meant to tackle the visualization tasks of large data sets, and the provided examples are fantastic for demonstrating the utter complexity that users may face. I'm especially glad to see that you have posted examples of how you munged the data. This is quite valuable to fair-weather Python users such as myself. 👍

    However, in order to follow along, users must start by downloading all 1M+ rows (and growing!) of the NYPDMVC data set. 😿 My suggestion would be to include a small subset of these data in the package (I believe you can specify the location with package_data in your setup file).

    opened by zkamvar 4
  • Returning matplotlib.figure/axes?

    Returning matplotlib.figure/axes?

    Hi, For users who want to fiddle around with the produced plot, it would be helpful to return the matplotllib.figure/axis. My use case- I want to give a ylabel to the rows to use in a publication.

    enhancement 
    opened by nipunbatra 4
  • source-file for pypi?

    source-file for pypi?

    Hi the github source file does not compile with `python3 setup.py install '

    Could you please publish a source file on pypi.org that does? (I am not using the whl file)

    opened by brobr 3
  • error in matrix function at line 117 - caused by line 185

    error in matrix function at line 117 - caused by line 185

    If the dataset has timestamps starting from middle of the day or ending with middle of the day then trying to divide by Day or Hour causes error. Sending through a fix.

    opened by armando-fandango 3
  • MatplotlibDeprecationWarning - grid(b=False) -> grid(visible=False)

    MatplotlibDeprecationWarning - grid(b=False) -> grid(visible=False)

    Running pytest results in:

    MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later. ax0.grid(b=False)

    opened by r-leyshon 0
  • Installation issues in M1 Mac

    Installation issues in M1 Mac

    Tried installing with both pip as well as conda, both give the error. Here's the conda installation log (after uninstalling the previous installation with pip):

    (base) [email protected] ~ % conda install -c conda-forge missingno
    Collecting package metadata (current_repodata.json): done
    Solving environment: done
    
    ## Package Plan ##
    
      environment location: /Users/Admin/miniforge3
    
      added / updated specs:
        - missingno
    
    
    The following packages will be UPDATED:
    
      conda              pkgs/main::conda-22.9.0-py310hca03da5~ --> conda-forge::conda-22.9.0-py310hbe9552e_2 None
      missingno          conda-forge/label/cf201901::missingno~ --> conda-forge::missingno-0.4.2-py_1 None
    
    The following packages will be SUPERSEDED by a higher-priority channel:
    
      ca-certificates    pkgs/main::ca-certificates-2022.10.11~ --> conda-forge::ca-certificates-2022.9.24-h4653dfc_0 None
      certifi            pkgs/main/osx-arm64::certifi-2022.9.2~ --> conda-forge/noarch::certifi-2022.9.24-pyhd8ed1ab_0 None
      openssl              pkgs/main::openssl-1.1.1s-h1a28f6b_0 --> conda-forge::openssl-1.1.1s-h03a7124_0 None
    
    
    Proceed ([y]/n)? y
    
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    Retrieving notices: ...working... done
    

    After successful installation, tried to import it in Python, get the following error:

    (base) [email protected] ~ % python
    Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import missingno
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/missingno/__init__.py", line 1, in <module>
        from .missingno import matrix
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/missingno/missingno.py", line 5, in <module>
        from scipy.cluster import hierarchy
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/cluster/__init__.py", line 25, in <module>
        from . import vq, hierarchy
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/cluster/vq.py", line 72, in <module>
        from scipy.spatial.distance import cdist
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/spatial/__init__.py", line 105, in <module>
        from ._kdtree import *
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/spatial/_kdtree.py", line 5, in <module>
        from ._ckdtree import cKDTree, cKDTreeNode
      File "_ckdtree.pyx", line 10, in init scipy.spatial._ckdtree
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/__init__.py", line 283, in <module>
        from . import csgraph
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/csgraph/__init__.py", line 182, in <module>
        from ._laplacian import laplacian
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/csgraph/_laplacian.py", line 7, in <module>
        from scipy.sparse.linalg import LinearOperator
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/__init__.py", line 120, in <module>
        from ._isolve import *
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/__init__.py", line 4, in <module>
        from .iterative import *
      File "/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/iterative.py", line 9, in <module>
        from . import _iterative
    ImportError: dlopen(/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/_iterative.cpython-310-darwin.so, 0x0002): Library not loaded: @rpath/liblapack.3.dylib
      Referenced from: <493DBB2C-B84A-3E4F-972C-B015A509EDE6> /Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/_iterative.cpython-310-darwin.so
      Reason: tried: '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/bin/../lib/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/bin/../lib/liblapack.3.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/[email protected]/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/lib/python3.10/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/bin/../lib/liblapack.3.dylib' (no such file), '/Users/Admin/miniforge3/bin/../lib/liblapack.3.dylib' (no such file), '/usr/local/lib/liblapack.3.dylib' (no such file), '/usr/lib/liblapack.3.dylib' (no such file, not in dyld cache)
    >>> 
    
    opened by PrashantSaikia 0
  • Add a title to graphs

    Add a title to graphs

    It would be useful to have a parameter to add a title to the graphs. This need comes from using missigno in a for...in loop, where the graphs are shown at the end of the loop, displacing graphs from displays.

    opened by hydrastarmaster 0
  • remove freq option in matrix

    remove freq option in matrix

    I'm seeing all the issues related to this option. I'm having issues with KeyError('Could not divide time index into desired frequency.'). The solution is to just remove this option. The index is the user's responsibility and the library should take it as is.

    There should be a separate data structing step which would be followed by the viz.

    opened by majidaldo 1
  • Choose strftime format argument in mnso.matrix (when using freq argument)

    Choose strftime format argument in mnso.matrix (when using freq argument)

    When I use the matrix function with a frequency inferior to a day the y axis labels are all the same because their format is fixed to '%Y-%m-%d'. It would be better to have the format as an option in function argument

    ts_ticks = pd.date_range(df.index[0], df.index[-1],
                                         freq=freq).map(lambda t:
                                                        t.strftime('%Y-%m-%d'))
    

    become

    ts_ticks = pd.date_range(df.index[0], df.index[-1],
                                         freq=freq).map(lambda t:
                                                        t.strftime(format))
    
    enhancement 
    opened by Sinnaeve 1
Releases(0.5.1)
  • 0.5.1(Feb 27, 2022)

  • 0.5.0(Jul 4, 2021)

    This is a maintenance release of missingno. The primary user-facing change is that the long-deprecated geoplot method and inline parameter have been removed.

    Source code(tar.gz)
    Source code(zip)
  • 0.4.2(Jul 9, 2019)

    This incremental release adds a minor feature and deprecates certain outdated functionality.

    • An ax parameter has been added to all plot types. Pass a matplotlib Axes object to this parameter to add the plot to a subplots or gridspec object (see #83). Note that the matrix plot does not support the sparkline parameter in this configuration.
    • The behavior of the sort parameter has changed. It has been removed from dendrogram and geoplot, where it did not do anything. bar will now use the sort parameter to order its markers (previously it had no effect on the plot output). The behavior of the sort parameter on matrix is unchanged.
    • The inline parameter has been deprecated and will be removed in a future version of missingno.
    • The geoplot function has been deprecated and will be removed in a future version of missingno. To replicate this functionality, see this recipe in the geoplot package documentation.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.1(Oct 25, 2016)

Owner
Aleksey Bilogur
Building tools for doing data science @spellml. {📊, 💻, 🛠️}. Previously: @quiltdata, @recursecenter, @Kaggle, @MODA-NYC.
Aleksey Bilogur
Cartopy - a cartographic python library with matplotlib support

Cartopy is a Python package designed to make drawing maps for data analysis and visualisation easy. Table of contents Overview Get in touch License an

1.2k Jan 01, 2023
Analysis and plotting for motor/prop/ESC characterization, thrust vs RPM and torque vs thrust

esc_test This is a Python package used to plot and analyze data collected for the purpose of characterizing a particular propeller, motor, and ESC con

Alex Spitzer 1 Dec 28, 2021
Some useful extensions for Matplotlib.

mplx Some useful extensions for Matplotlib. Contour plots for functions with discontinuities plt.contour mplx.contour(max_jump=1.0) Matplotlib has pro

Nico Schlömer 519 Dec 30, 2022
Dimensionality reduction in very large datasets using Siamese Networks

ivis Implementation of the ivis algorithm as described in the paper Structure-preserving visualisation of high dimensional single-cell datasets. Ivis

beringresearch 284 Jan 01, 2023
Script to create an animated data visualisation for categorical timeseries data - GIF choropleth map with annotations.

choropleth_ldn Simple script to create a chloropleth map of London with categorical timeseries data. The script in main.py creates a gif of the most f

1 Oct 07, 2021
Color scales in Python for humans

colorlover Color scales for humans IPython notebook: https://plot.ly/ipython-notebooks/color-scales/ import colorlover as cl from IPython.display impo

Plotly 146 Sep 25, 2022
:art: Diagram as Code for prototyping cloud system architectures

Diagrams Diagram as Code. Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture d

MinJae Kwon 27.5k Dec 30, 2022
LabGraph is a a Python-first framework used to build sophisticated research systems with real-time streaming, graph API, and parallelism.

LabGraph is a a Python-first framework used to build sophisticated research systems with real-time streaming, graph API, and parallelism.

MLH Fellowship 7 Oct 05, 2022
CompleX Group Interactions (XGI) provides an ecosystem for the analysis and representation of complex systems with group interactions.

XGI CompleX Group Interactions (XGI) is a Python package for the representation, manipulation, and study of the structure, dynamics, and functions of

Complex Group Interactions 67 Dec 28, 2022
🎨 Python Echarts Plotting Library

pyecharts Python ❤️ ECharts = pyecharts English README 📣 简介 Apache ECharts (incubating) 是一个由百度开源的数据可视化,凭借着良好的交互性,精巧的图表设计,得到了众多开发者的认可。而 Python 是一门富有表达

pyecharts 13.1k Jan 03, 2023
Boltzmann visualization - Visualize the Boltzmann distribution for simple quantum models of molecular motion

Boltzmann visualization - Visualize the Boltzmann distribution for simple quantum models of molecular motion

1 Jan 22, 2022
Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database

SpiderFoot Neo4j Tools Import, visualize, and analyze SpiderFoot OSINT data in Neo4j, a graph database Step 1: Installation NOTE: This installs the sf

Black Lantern Security 42 Dec 26, 2022
cqMore is a CadQuery plugin based on CadQuery 2.1.

cqMore (under construction) cqMore is a CadQuery plugin based on CadQuery 2.1. Installation Please use conda to install CadQuery and its dependencies

Justin Lin 36 Dec 21, 2022
Statistical data visualization using matplotlib

seaborn: statistical data visualization Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing

Michael Waskom 10.2k Dec 30, 2022
3D plotting and mesh analysis through a streamlined interface for the Visualization Toolkit (VTK)

PyVista Deployment Build Status Metrics Citation License Community 3D plotting and mesh analysis through a streamlined interface for the Visualization

PyVista 1.6k Jan 08, 2023
Small project demonstrating the use of Grafana and InfluxDB for monitoring the speed of an internet connection

Speedtest monitor for Grafana A small project that allows internet speed monitoring using Grafana, InfluxDB 2 and Speedtest. Demo Requirements Docker

Joshua Ghali 3 Aug 06, 2021
Color maps for POV-Ray v3.7 from the Plasma, Inferno, Magma and Viridis color maps in Python's Matplotlib

POV-Ray-color-maps Color maps for POV-Ray v3.7 from the Plasma, Inferno, Magma and Viridis color maps in Python's Matplotlib. The include file Color_M

Tor Olav Kristensen 1 Apr 05, 2022
A custom qq-plot for two sample data comparision

QQ-Plot 2 Sample Just a gist to include the custom code to draw a qq-plot in python when dealing with a "two sample problem". This means when u try to

1 Dec 20, 2021
A toolkit to generate MR sequence diagrams

mrsd: a toolkit to generate MR sequence diagrams mrsd is a Python toolkit to generate MR sequence diagrams, as shown below for the basic FLASH sequenc

Julien Lamy 3 Dec 25, 2021
A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Torch and Numpy.

Visdom A flexible tool for creating, organizing, and sharing visualizations of live, rich data. Supports Python. Overview Concepts Setup Usage API To

FOSSASIA 9.4k Jan 07, 2023