A Numba-based two-point correlation function calculator using a grid decomposition

Overview

numba-2pcf

tests

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Aside from the 2PCF calculation, the particle_grid module is both simple and fast and may be useful on its own as a way to partition particle sets in 3D.

Installation

$ git clone https://github.com/lgarrison/numba-2pcf.git
$ cd numba-2pcf
$ python -m pip install -e .

Example

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2.
pos = rng.random((N,3), dtype=np.float32)*box

res = numba_2pcf(pos, box, Rmax=0.05, nbin=10)
res.pprint_all()
        rmin                 rmax                 rmid                    xi            npairs 
-------------------- -------------------- -------------------- ----------------------- --------
                 0.0 0.005000000074505806 0.002500000037252903   -0.004519257448573177    65154
0.005000000074505806 0.010000000149011612  0.00750000011175871   0.0020113763064291135   459070
0.010000000149011612  0.01500000022351742 0.012500000186264515    0.000984359247434119  1244770
 0.01500000022351742 0.020000000298023225 0.017500000260770324  -6.616896085054336e-06  2421626
0.020000000298023225  0.02500000037252903 0.022500000335276125  0.00019365366488166558  3993210
 0.02500000037252903  0.03000000044703484 0.027500000409781934   5.769329601057471e-05  5956274
 0.03000000044703484  0.03500000052154064 0.032500000484287736   0.0006815801672250821  8317788
 0.03500000052154064  0.04000000059604645 0.037500000558793545    2.04711840243732e-05 11061240
 0.04000000059604645  0.04500000067055226 0.042500000633299354   9.313641918828885e-05 14203926
 0.04500000067055226  0.05000000074505806  0.04750000070780516 -0.00011690771042793813 17734818

Performance

The goal of this project is not to provide the absolute best performance that given hardware can produce, but it is a goal to provide as good performance as Numba will let us reach (while keeping the code readable). So we pay special attention to things like dtype (use float32 particle inputs when possible!), parallelization, and some early-exit conditions (when we know a pair can't fall in any bin).

As a demonstration that this code provides passably good performance, here's a dummy test of 107 unclustered data points in a 2 Gpc/h box (so number density 1.2e-3), with Rmax=200 Mpc/h and bin width of 1 Mpc/h:

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2000
pos = rng.random((N,3), dtype=np.float32)*box

%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=False, nthread=24)  # 3.5 s
%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=True, nthread=24)  # 1.3 s

So within a factor of 3 of Corrfunc, and we aren't even exploiting the symmetry of the autocorrelation (i.e. we count every pair twice). Not bad!

Testing Against Corrfunc

The code is tested against Corrfunc. And actually, the numba_2pcf() function takes a flag corrfunc=True that calls Corrfunc instead of the Numba implementation to make such testing even easier.

Details

numba_2pcf works a lot like Corrfunc, or any other grid-based 2PCF code: the 3D volume is divided into a grid of cells at least Rmax in size, where Rmax is the maximum radius of the correlation function measurement. Then, we know all valid particle pairs must be in neighboring cells. So the task is simply to loop through each cell in the grid, pairing it with each of its 26 neighbors (plus itself). We parallelize over cell pairs, and add up all the pair counts across threads at the end.

This grid decomposition prunes distant pairwise comparisons, so even though the runtime still formally scales as O(N2), it makes the 2PCF tractable for many realistic problems in cosmology and large-scale structure.

A numba implementation isn't likely to beat Corrfunc on speed, but numba can still be fast enough to be useful (especially when the computation parallelizes well). The idea is that this code provides a "fast enough" parallel implementation while still being highly readable --- the 2PCF implementation is about 150 lines of code, and the gridding scheme 100 lines.

Branches

The particle-jackknife branch contains an implementation of an idea for computing the xi(r) variance based on the variance of the per-particle xi(r) measurements. It doesn't seem to be measuring the right thing, but the code is left for posterity.

Acknowledgments

This repo was generated from @DFM's Cookiecutter Template. Thanks, DFM!

Owner
Lehman Garrison
Flatiron Research Fellow at the Center for Computational Astrophysics
Lehman Garrison
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Python for Data 866 Dec 16, 2022
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of D

Joachim 1 Dec 26, 2021
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
Modular analysis tools for neurophysiology data

Neuroanalysis Modular and interactive tools for analysis of neurophysiology data, with emphasis on patch-clamp electrophysiology. Functions for runnin

Allen Institute 5 Dec 22, 2021
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Basis Set Format Converter

Basis Set Format Converter Repository for the online tool that allows you to enter a basis set in the form of text input for a variety of Quantum Chem

Manas Sharma 3 Jun 27, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
The lastest all in one bombing tool coded in python uses tbomb api

BaapG-Attack is a python3 based script which is officially made for linux based distro . It is inbuit mass bomber with sms, mail, calls and many more bombing

59 Dec 25, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022