CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Overview

CausalNLP

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Install

  1. pip install -U pip
  2. pip install causalnlp

Usage

Example: What is the causal impact of a positive review on a product click?

import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)

The file music_seed50.tsv is a semi-simulated dataset from here. Columns of relevance include:

  • Y_sim: outcome, where 1 means product was clicked and 0 means not.
  • text: raw text of review
  • rating: rating associated with review (1 through 5)
  • T_true: 1 means rating less than 3, 0 means rating of 5, where T_true affects the outcome Y_sim.
  • T_ac: an approximation of true review sentiment (T_true) created with Autocoder from raw review text
  • C_true:confounding categorical variable (1=audio CD, 0=other)

We'll pretend the true sentiment (i.e., review rating and T_true) is hidden and only use T_ac as the treatment variable.

Using the text_col parameter, we include the raw review text as another "controlled-for" variable.

from causalnlp.causalinference import CausalInferenceModel
from lightgbm import LGBMClassifier
cm = CausalInferenceModel(df, 
                         metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
                         treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
                         include_cols=['C_true'])
cm.fit()
outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['C_true']
text covariate: text
preprocess time:  1.1179866790771484  sec
start fitting causal inference model
time to fit causal inference model:  10.361494302749634  sec

Estimating Treatment Effects

CausalNLP supports estimation of heterogeneous treatment effects (i.e., how causal impacts vary across observations, which could be documents, emails, posts, individuals, or organizations).

We will first calculate the overall average treatment effect (or ATE), which shows that a positive review increases the probability of a click by 13 percentage points in this dataset.

Average Treatment Effect (or ATE):

print( cm.estimate_ate() )
{'ate': 0.1309311542209525}

Conditional Average Treatment Effect (or CATE): reviews that mention the word "toddler":

print( cm.estimate_ate(df['text'].str.contains('toddler')) )
{'ate': 0.15559234254638685}

Individualized Treatment Effects (or ITE):

test_df = pd.DataFrame({'T_ac' : [1], 'C_true' : [1], 
                        'text' : ['I never bought this album, but I love his music and will soon!']})
effect = cm.predict(test_df)
print(effect)
[[0.80538201]]

Model Interpretability:

print( cm.interpret(plot=False)[1][:10] )
v_music    0.079042
v_cd       0.066838
v_album    0.055168
v_like     0.040784
v_love     0.040635
C_true     0.039949
v_just     0.035671
v_song     0.035362
v_great    0.029918
v_heard    0.028373
dtype: float64

Features with the v_ prefix are word features. C_true is the categorical variable indicating whether or not the product is a CD.

Text is Optional in CausalNLP

Despite the "NLP" in CausalNLP, the library can be used for causal inference on data without text (e.g., only numerical and categorical variables). See the examples for more info.

Documentation

API documentation and additional usage examples are available at: https://amaiya.github.io/causalnlp/

How to Cite

Please cite the following paper when using CausalNLP in your work:

@article{maiya2021causalnlp,
    title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
    author={Arun S. Maiya},
    year={2021},
    eprint={2106.08043},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    journal={arXiv preprint arXiv:2106.08043},
}
You might also like...
Llvlir - Low Level Variable Length Intermediate Representation

Low Level Variable Length Intermediate Representation Low Level Variable Length

Semi-automated OpenVINO benchmark_app with variable parameters

Semi-automated OpenVINO benchmark_app with variable parameters. User can specify multiple options for any parameters in the benchmark_app and the progam runs the benchmark with all combinations of given options.

This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit
This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit

BMW Semantic Segmentation GPU/CPU Inference API This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit. The train

This is a repository for a semantic segmentation inference API using the OpenVINO toolkit
This is a repository for a semantic segmentation inference API using the OpenVINO toolkit

BMW-IntelOpenVINO-Segmentation-Inference-API This is a repository for a semantic segmentation inference API using the OpenVINO toolkit. It's supported

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.
The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Note: This is an alpha (preview) version which is still under refining. nn-Meter is a novel and efficient system to accurately predict the inference l

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

🍐 quince Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding 🍐 Installation $ git clone [email protected]

Comments
  • Does your model support other languages than English?

    Does your model support other languages than English?

    Hi Amaiya, Thanks for your great package. Would you kindly let me know if your package supports languages other than English when using CausalBert?

    I'm also interested in knowing whether I can exploit other Transformers models from the Huggingface hub?

    question 
    opened by behroozazarkhalili 1
  • Error while fitting the model

    Error while fitting the model

    Hi,

    I ran to this bug while fitting the model. I checked the data and everything looks good. I don't get the root cause of this error.

    File /opt/conda/lib/python3.8/site-packages/causalnlp/meta/slearner.py:80, in BaseSLearner.fit(self, X, treatment, y, p)
         78 mask = (treatment == group) | (treatment == self.control_name)
         79 treatment_filt = treatment[mask]
    ---> 80 X_filt = X[mask]
         81 y_filt = y[mask]
         83 w = (treatment_filt == group).astype(int)
    
    IndexError: boolean index did not match indexed array along dimension 0
    
    opened by hfarhidzadeh 1
Releases(v0.7.0)
  • v0.7.0(Aug 2, 2022)

  • v0.6.0(Oct 20, 2021)

    0.6.0 (2021-10-20)

    New:

    • Added model_name parameter to CausalBertModel to support other DistilBert models (e.g., multilingual)

    Changed

    • N/A

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Sep 3, 2021)

    0.5.0 (2021-09-03)

    New:

    • Added support for CausalBert

    Changed

    • Added p parameter to CausalInferenceModel.fit and CausalInferenceModel.predict for user-supplied propensity scores in X-Learner and R-Learner.
    • Removed CV from propensity score computations in X-Learner and R-Learner and increase default max_iter to 10000

    Fixed:

    • Resolved problem with CausalInferenceModel.tune_and_use_default_learner when outcome is continuous
    • Changed to max_iter=10000 for default LogisticRegression base learner
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Sep 3, 2021)

    0.4.0 (2021-07-20)

    New:

    • N/A

    Changed

    • Use LinearRegression and LogisticRegression as default base learners for s-learner.
    • changed parameter name of metalearner_type to method in CausalInferenceModel.

    Fixed:

    • Resolved mis-references in _balance method (renamed from _minimize_bias).
    • Fixed convergence issues and factored out propensity score computations to CausalInferenceModel.compute_propensity_scores.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 19, 2021)

  • v0.3.0(Jul 15, 2021)

    0.3.0 (2021-07-15)

    New:

    • Added CausalInferenceModel.evaluate_robustness method to assess robustness of causal estimates using sensitivity analysis

    Changed

    • reduced dependencies with local metalearner implementations

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 21, 2021)

  • v0.1.3(Jun 17, 2021)

  • v0.1.2(Jun 17, 2021)

    0.1.2 (2021-06-17)

    New:

    • N/A

    Changed

    • Better interpretability and explainability of treatment effects

    Fixed:

    • Fixes to some bugs in preprocessing
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Jun 17, 2021)

  • v0.1.0(Jun 16, 2021)

Owner
Arun S. Maiya
computer scientist
Arun S. Maiya
Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization Code for reproducing our results in the Head2Toe paper. Paper: arxiv.or

Google Research 62 Dec 12, 2022
A smart Chat bot that can help to know about corona virus and Make prediction of corona using X-ray.

TRINIT_Hum_kuchh_nahi_karenge_ML01 Document Link https://github.com/Jatin-Goyal-552/TRINIT_Hum_kuchh_nahi_karenge_ML01/blob/main/hum_kuchh_nahi_kareng

JatinGoyal 1 Feb 03, 2022
Some pvbatch (paraview) scripts for postprocessing OpenFOAM data

pvbatchForFoam Some pvbatch (paraview) scripts for postprocessing OpenFOAM data For every script there is a help message available: pvbatch pv_state_s

Morev Ilya 2 Oct 26, 2022
Unified learning approach for egocentric hand gesture recognition and fingertip detection

Unified Gesture Recognition and Fingertip Detection A unified convolutional neural network (CNN) algorithm for both hand gesture recognition and finge

Mohammad 227 Dec 25, 2022
Learning Compatible Embeddings, ICCV 2021

LCE Learning Compatible Embeddings, ICCV 2021 by Qiang Meng, Chixiang Zhang, Xiaoqiang Xu and Feng Zhou Paper: Arxiv We cannot release source codes pu

Qiang Meng 25 Dec 17, 2022
A more easy-to-use implementation of KPConv based on PyTorch.

A more easy-to-use implementation of KPConv This repo contains a more easy-to-use implementation of KPConv based on PyTorch. Introduction KPConv is a

Zheng Qin 36 Dec 29, 2022
Self-Regulated Learning for Egocentric Video Activity Anticipation

Self-Regulated Learning for Egocentric Video Activity Anticipation Introduction This is a Pytorch implementation of the model described in our paper:

qzhb 13 Sep 23, 2022
Official Implementation of "Designing an Encoder for StyleGAN Image Manipulation"

Designing an Encoder for StyleGAN Image Manipulation (SIGGRAPH 2021) Recently, there has been a surge of diverse methods for performing image editing

749 Jan 09, 2023
NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥

NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥

4.8k Jan 07, 2023
The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

MOTIF Dataset The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled

Booz Allen Hamilton 112 Dec 13, 2022
Implementations of CNNs, RNNs, GANs, etc

Tensorflow Programs and Tutorials This repository will contain Tensorflow tutorials on a lot of the most popular deep learning concepts. It'll also co

Adit Deshpande 1k Dec 30, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023
Wordle-solver - Wordle answer generation program in python

🟨 Wordle Solver 🟩 Wordle answer generation program in python ✔️ Requirements U

Dahyun Kang 4 May 28, 2022
Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! Very tiny! Stock Market Financial Technical Analysis Python library . Quant Trading automation or cryptocoin exchange

MyTT Technical Indicators implemented in Python only using Numpy-Pandas as Magic - Very Very Fast! to Stock Market Financial Technical Analysis Python

dev 34 Dec 27, 2022
This is an official implementation of the CVPR2022 paper "Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots".

Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots Blind2Unblind Citing Blind2Unblind @inproceedings{wang2022blind2unblind, tit

demonsjin 58 Dec 06, 2022
LogDeep is an open source deeplearning-based log analysis toolkit for automated anomaly detection.

LogDeep is an open source deeplearning-based log analysis toolkit for automated anomaly detection.

donglee 279 Dec 13, 2022
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

<a href=[email protected](SZ)"> 7 Dec 16, 2021
Code for paper "Multi-level Disentanglement Graph Neural Network"

Multi-level Disentanglement Graph Neural Network (MD-GNN) This is a PyTorch implementation of the MD-GNN, and the code includes the following modules:

Lirong Wu 6 Dec 29, 2022
A PyTorch implementation of "ANEMONE: Graph Anomaly Detection with Multi-Scale Contrastive Learning", CIKM-21

ANEMONE A PyTorch implementation of "ANEMONE: Graph Anomaly Detection with Multi-Scale Contrastive Learning", CIKM-21 Dependencies python==3.6.1 dgl==

Graph Analysis & Deep Learning Laboratory, GRAND 30 Dec 14, 2022
GNN4Traffic - This is the repository for the collection of Graph Neural Network for Traffic Forecasting

GNN4Traffic - This is the repository for the collection of Graph Neural Network for Traffic Forecasting

564 Jan 02, 2023