Python module for performing linear regression for data with measurement errors and intrinsic scatter

Overview

Linear regression for data with measurement errors and intrinsic scatter (BCES)

Python module for performing robust linear regression on (X,Y) data points where both X and Y have measurement errors.

The fitting method is the bivariate correlated errors and intrinsic scatter (BCES) and follows the description given in Akritas & Bershady. 1996, ApJ. Some of the advantages of BCES regression compared to ordinary least squares fitting (quoted from Akritas & Bershady 1996):

  • it allows for measurement errors on both variables
  • it permits the measurement errors for the two variables to be dependent
  • it permits the magnitudes of the measurement errors to depend on the measurements
  • other "symmetric" lines such as the bisector and the orthogonal regression can be constructed.

In order to understand how to perform and interpret the regression results, please read the paper.

Installation

Using pip:

pip install bces

If that does not work, you can install it using the setup.py script:

python setup.py install

You may need to run the last command with sudo.

Alternatively, if you plan to modify the source then install the package with a symlink, so that changes to the source files will be immediately available:

python setup.py develop

Usage

import bces.bces as BCES
a,b,aerr,berr,covab=BCES.bcesp(x,xerr,y,yerr,cov)

Arguments:

  • x,y : 1D data arrays
  • xerr,yerr: measurement errors affecting x and y, 1D arrays
  • cov : covariance between the measurement errors, 1D array

If you have no reason to believe that your measurement errors are correlated (which is usually the case), you can provide an array of zeroes as input for cov:

cov = numpy.zeros_like(x)

Output:

  • a,b : best-fit parameters a,b of the linear regression such that y = Ax + B.
  • aerr,berr : the standard deviations in a,b
  • covab : the covariance between a and b (e.g. for plotting confidence bands)

Each element of the arrays a, b, aerr, berr and covab correspond to the result of one of the different BCES lines: y|x, x|y, bissector and orthogonal, as detailed in the table below. Please read the original BCES paper to understand what these different lines mean.

Element Method Description
0 y|x Assumes x as the independent variable
1 x|y Assumes y as the independent variable
2 bissector Line that bisects the y|x and x|y. This approach is self-inconsistent, do not use this method, cf. Hogg, D. et al. 2010, arXiv:1008.4686.
3 orthogonal Orthogonal least squares: line that minimizes orthogonal distances. Should be used when it is not clear which variable should be treated as the independent one

By default, bcesp run in parallel with bootstrapping.

Examples

bces-example.ipynb is a jupyter notebook including a practical, step-by-step example of how to use BCES to perform regression on data with uncertainties on x and y. It also illustrates how to plot the confidence band for a fit.

If you have suggestions of more examples, feel free to add them.

Running Tests

To test your installation, run the following command inside the BCES directory:

pytest -v

Requirements

See requirements.txt.

Citation

If you end up using this code in your paper, you are morally obliged to cite the following works

I spent considerable time writing this code, making sure it is correct and user-friendly, so I would appreciate your citation of the second paper in the above list as a token of gratitude.

If you are really happy with the code, you can buy me a beer.

Misc.

This python module is inspired on the (much faster) fortran routine originally written Akritas et al. I wrote it because I wanted something more portable and easier to use, trading off speed.

For a general tutorial on how to (and how not to) perform linear regression, please read this paper: Hogg, D. et al. 2010, arXiv:1008.4686. In particular, please refrain from using the bisector method.

If you want to plot confidence bands for your fits, have a look at nmmn package (in particular, modules nmmn.plots.fitconf and stats).

Bayesian linear regression

There are a couple of Bayesian approaches to perform linear regression which can be more powerful than BCES, some of which are described below.

A Gibbs Sampler for Multivariate Linear Regression: R code, arXiv:1509.00908. Linear regression in the fairly general case with errors in X and Y, errors may be correlated, intrinsic scatter. The prior distribution of covariates is modeled by a flexible mixture of Gaussians. This is an extension of the very nice work by Brandon Kelly (Kelly, B. 2007, ApJ).

LIRA: A Bayesian approach to linear regression in astronomy: R code, arXiv:1509.05778 Bayesian hierarchical modelling of data with heteroscedastic and possibly correlated measurement errors and intrinsic scatter. The method fully accounts for time evolution. The slope, the normalization, and the intrinsic scatter of the relation can evolve with the redshift. The intrinsic distribution of the independent variable is approximated using a mixture of Gaussian distributions whose means and standard deviations depend on time. The method can address scatter in the measured independent variable (a kind of Eddington bias), selection effects in the response variable (Malmquist bias), and departure from linearity in form of a knee.

AstroML: Machine Learning and Data Mining for Astronomy. Python example of a linear fit to data with correlated errors in x and y using AstroML. In the literature, this is often referred to as total least squares or errors-in-variables fitting.

Todo

If you have improvements to the code, suggestions of examples,speeding up the code etc, feel free to submit a pull request.

  • implement weighted least squares (WLS)
  • implement unit testing: bces
  • unit testing: bootstrap

Visit the author's web page and/or follow him on twitter (@nemmen).


Copyright (c) 2021, Rodrigo Nemmen. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Owner
Rodrigo Nemmen
Professor of Astronomy & Astrophysics
Rodrigo Nemmen
An easier way to build neural search on the cloud

Jina is geared towards building search systems for any kind of data, including text, images, audio, video and many more. With the modular design & multi-layer abstraction, you can leverage the effici

Jina AI 17k Jan 01, 2023
MICOM is a Python package for metabolic modeling of microbial communities

Welcome MICOM is a Python package for metabolic modeling of microbial communities currently developed in the Gibbons Lab at the Institute for Systems

57 Dec 21, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 09, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
Course files for "Ocean/Atmosphere Time Series Analysis"

time-series This package contains all necessary files for the course Ocean/Atmosphere Time Series Analysis, an introduction to data and time series an

Jonathan Lilly 107 Nov 29, 2022
Decision Weights in Prospect Theory

Decision Weights in Prospect Theory It's clear that humans are irrational, but how irrational are they? After some research into behavourial economics

Cameron Davidson-Pilon 32 Nov 08, 2021
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
🤖 ⚡ scikit-learn tips

🤖 ⚡ scikit-learn tips New tips are posted on LinkedIn, Twitter, and Facebook. 👉 Sign up to receive 2 video tips by email every week! 👈 List of all

Kevin Markham 1.6k Jan 03, 2023
Tools for diffing and merging of Jupyter notebooks.

nbdime provides tools for diffing and merging of Jupyter Notebooks.

Project Jupyter 2.3k Jan 03, 2023
Uber Open Source 1.6k Dec 31, 2022
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
PyTorch extensions for high performance and large scale training.

Description FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library ext

Facebook Research 2k Dec 28, 2022
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 05, 2023
🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

Real-time water systems lab 416 Jan 06, 2023
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

2 Jun 14, 2022
Primitives for machine learning and data science.

An Open Source Project from the Data to AI Lab, at MIT MLPrimitives Pipelines and primitives for machine learning and data science. Documentation: htt

MLBazaar 65 Dec 29, 2022
A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

Victor Umunna 7 Oct 13, 2022
Drug prediction

I have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Dr

Khazar 1 Jan 28, 2022