Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Overview

Version Build status Code coverage Support Python versions

weightedcalcs

weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more.

Features

  • Plays well with pandas.
  • Support for weighted means, medians, quantiles, standard deviations, and distributions.
  • Support for grouped calculations, using DataFrameGroupBy objects.
  • Raises an error when your data contains null-values.
  • Full test coverage.

Installation

pip install weightedcalcs

Usage

Getting started

Every weighted calculation in weightedcalcs begins with an instance of the weightedcalcs.Calculator class. Calculator takes one argument: the name of your weighting variable. So if you're analyzing a survey where the weighting variable is called "resp_weight", you'd do this:

import weightedcalcs as wc
calc = wc.Calculator("resp_weight")

Types of calculations

Currently, weightedcalcs.Calculator supports the following calculations:

  • calc.mean(my_data, value_var): The weighted arithmetic average of value_var.
  • calc.quantile(my_data, value_var, q): The weighted quantile of value_var, where q is between 0 and 1.
  • calc.median(my_data, value_var): The weighted median of value_var, equivalent to .quantile(...) where q=0.5.
  • calc.std(my_data, value_var): The weighted standard deviation of value_var.
  • calc.distribution(my_data, value_var): The weighted proportions of value_var, interpreting value_var as categories.
  • calc.count(my_data): The weighted count of all observations, i.e., the total weight.
  • calc.sum(my_data, value_var): The weighted sum of value_var.

The obj parameter above should one of the following:

  • A pandas DataFrame object
  • A pandas DataFrame.groupby object
  • A plain Python dictionary where the keys are column names and the values are equal-length lists.

Basic example

Below is a basic example of using weightedcalcs to find what percentage of Wyoming residents are married, divorced, et cetera:

import pandas as pd
import weightedcalcs as wc

# Load the 2015 American Community Survey person-level responses for Wyoming
responses = pd.read_csv("examples/data/acs-2015-pums-wy-simple.csv")

# `PWGTP` is the weighting variable used in the ACS's person-level data
calc = wc.Calculator("PWGTP")

# Get the distribution of marriage-status responses
calc.distribution(responses, "marriage_status").round(3).sort_values(ascending=False)

# -- Output --
# marriage_status
# Married                                0.425
# Never married or under 15 years old    0.421
# Divorced                               0.097
# Widowed                                0.046
# Separated                              0.012
# Name: PWGTP, dtype: float64

More examples

See this notebook to see examples of other calculations, including grouped calculations.

Max Ghenis has created a version of the example notebook that can be run directly in your browser, via Google Colab.

Weightedcalcs in the wild

Other Python weighted-calculation libraries

Owner
Jeremy Singer-Vine
Human @ Internet • Data Editor @ BuzzFeed News • Newsletter-er @ data-is-plural.com
Jeremy Singer-Vine
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

Tensorwerk 193 Nov 29, 2022
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022
Renato 214 Jan 02, 2023
talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

David Cournapeau 76 Nov 30, 2022
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
Data exploration done quick.

Pandas Tab Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations. Support: Python 3.7 and 3.

W.D. 20 Aug 27, 2022
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 02, 2022