Entity-Based Knowledge Conflicts in Question Answering.

Overview

Entity-Based Knowledge Conflicts in Question Answering

Run Instructions | Paper | Citation | License

This repository provides the Substitution Framework described in Section 2 of our paper Entity-Based Knowledge Conflicts in Question Answering. Given a quesion answering dataset, we derive a new dataset where the context passages have been modified to have new answers to their question. By training on the original examples and evaluating on the derived examples, we simulate a parametric-contextual knowledge conflict --- useful for understanding how model's employ sources of knowledge to arrive at a decision.

Our dataset derivation follows two steps: (1) identifying named entity answers, and (2) replacing all occurrences of the answer in the context with a substituted entity, effectively changing the answer. The answer substitutions depend on the chosen substitution policy.

Run Instructions

1. Setup

Setup requirements and download SpaCy and WikiData dependencies.

bash setup.sh

2. (Optional) Download and Process Wikidata

This optional stage reproduces wikidata/entity_info.json.gz, downloaded during Setup.

Download the Wikidata dump from October 2020 here and the Wikipedia pageviews from June 2, 2020 here.

NOTE: We don't use the newest Wikidata dump because Wikidata doesn't keep old dumps so reproducibility is an issue. If you'd like to use the newest dump, it is available here. Wikipedia pageviews, on the other hand, are kept around and can be found here. Be sure to download the *-user.bz2 file and not the *-automatic.bz2 or the *-spider.bz2 files.

To extract out Wikidata information, run the following (takes ~8 hours)

python extract_wikidata_info.py --wikidata_dump wikidata-20201026-all.json.bz2 --popularity_dump pageviews-20210602-user.bz2 --output_file entity_info.json.gz

The output file of this step is available here.

3. Load and Preprocess Dataset

PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsTrain -w wikidata/entity_info.json.gz
PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsDev -w wikidata/entity_info.json.gz

4. Generate Substitutions

PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsTrain.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsTrain
   
    .jsonl 
    
      -n 1 ...
PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsDev.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsDev
     
      .jsonl 
      
        -n 1 ...

      
     
    
   

See descriptions of the substitution policies (substitution-commands) we provide here. Inspect the argparse and substitution-specific subparsers in generate_substitutions.py to see additional arguments.

Our Substitution Functions

Here we define the the substitution functions we provide. These functions ingests a QADataset, and modifies the context passage, according to defined rules, such that there is now a new answer to the question, according to the context. Greater detail is provided in our paper.

  • Alias Substitution (sub-command: alias-substitution) --- Here we replace an answer with one of it's wikidata aliases. Since the substituted answer is always semantically equivalent, answer type preservation is naturally maintained.
  • Popularity Substitution (sub-command: popularity-substitution) --- Here we replace answers with a WikiData answer of the same type, with a specified popularity bracket (according to monthly page views).
  • Corpus Substitution (sub-command: corpus-substitution) --- Here we replace answers with other answers of the same type, sampled from the same corpus.
  • Type Swap Substitution (sub-command: type-swap-substitution) --- Here we replace answers with other answers of different type, sampled from the same corpus.

How to Add Your own Dataset / Substitution Fn / NER Models

Use your own Dataset

To add your own dataset, create your own subclass of QADataset (in src/classes/qadataset.py).

  1. Overwrite the read_original_dataset function, to read your dataset, creating a List of QAExample objects.
  2. Add your class and the url/filepath to the DATASETS variable in src/load_dataset.py.

See MRQANaturalQuetsionsDataset in src/classes/qadataset.py as an example.

Use your own Substitution Function

We define 5 different substitution functions in src/generate_substitutions.py. These are described here. Inspect their docstrings and feel free to add your own, leveraging any of the wikidata, derived answer type, or other info we populate for examples and answers. Here are the steps to create your own:

  1. Add a subparser in src/generate_substitutions.py for your new function, with any relevant parameters. See alias_sub_parser as an example.
  2. Add your own substitution function to src/substitution_fns.py, ensuring the signature arguments match those specified in the subparser. See alias_substitution_fn as an example.
  3. Add a reference to your new function to SUBSTITUTION_FNS in src/generate_substitutions.py. Ensure the dictionary key matches the subparser name.

Use your own Named Entity Recognition and/or Entity Linking Model

Our SpaCy NER model is trained and used mainly to categorize answer text into answer types. Only substitutions that preserve answer type are likely to be coherent.

The functions which need to be changed are:

  1. run_ner_linking in utils.py, which loads the NER model and populates info for each answer (see function docstring).
  2. Answer._select_answer_type() in src/classes/answer.py, which uses the NER answer type label and wikidata type labels to cateogrize the answer into a type category.

Citation

Please cite the following if you found this resource or our paper useful.

@misc{longpre2021entitybased,
      title={Entity-Based Knowledge Conflicts in Question Answering}, 
      author={Shayne Longpre and Kartik Perisetla and Anthony Chen and Nikhil Ramesh and Chris DuBois and Sameer Singh},
      year={2021},
      eprint={2109.05052},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The Knowledge Conflicts repository, and entity-based substitution framework are licensed according to the LICENSE file.

Contact Us

To contact us feel free to email the authors in the paper or create an issue in this repository.

Owner
Apple
Apple
A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Tensorpack is a neural network training interface based on TensorFlow. Features: It's Yet Another TF high-level API, with speed, and flexibility built

Tensorpack 6.2k Jan 09, 2023
This is the repo for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement".

Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement This is the repository for the paper "Improving the Accuracy-Memory Trad

3 Dec 29, 2022
Code, pre-trained models and saliency results for the paper "Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images".

Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB This repository is the official implementation of the paper. Our results comming soon in

Xiaoqiang Wang 8 May 22, 2022
Robotics with GPU computing

Robotics with GPU computing Cupoch is a library that implements rapid 3D data processing for robotics using CUDA. The goal of this library is to imple

Shirokuma 625 Jan 07, 2023
Explore extreme compression for pre-trained language models

Code for paper "Exploring extreme parameter compression for pre-trained language models ICLR2022"

twinkle 16 Nov 14, 2022
Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling

RHGN Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling Dependencies torch==1.6.0 torchvision==0.7.0 dgl==0.7.1

Big Data and Multi-modal Computing Group, CRIPAC 6 Nov 29, 2022
This repository contains code for the paper "Disentangling Label Distribution for Long-tailed Visual Recognition", published at CVPR' 2021

Disentangling Label Distribution for Long-tailed Visual Recognition (CVPR 2021) Arxiv link Blog post This codebase is built on Causal Norm. Install co

Hyperconnect 85 Oct 18, 2022
Aws-machine-learning-university-accelerated-tab - Machine Learning University: Accelerated Tabular Data Class

Machine Learning University: Accelerated Tabular Data Class This repository contains slides, notebooks, and datasets for the Machine Learning Universi

AWS Samples 916 Dec 23, 2022
Neural HMMs are all you need (for high-quality attention-free TTS)

Neural HMMs are all you need (for high-quality attention-free TTS) Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter This is the official

Shivam Mehta 0 Oct 28, 2022
Locally cache assets that are normally streamed in POPULATION: ONE

Population One Localizer This is no longer needed as of the build shipped on 03/03/22, thank you bigbox :) Locally cache assets that are normally stre

Ahman Woods 2 Mar 04, 2022
Repository for tackling Kaggle Ultrasound Nerve Segmentation challenge using Torchnet.

Ultrasound Nerve Segmentation Challenge using Torchnet This repository acts as a starting point for someone who wants to start with the kaggle ultraso

Qure.ai 46 Jul 18, 2022
Neural network chess engine trained on Gary Kasparov's games.

Neural Chess It's not the best chess engine, but it is a chess engine. Proof of concept neural network chess engine (feed-forward multi-layer perceptr

3 Jun 22, 2022
[CVPR 2022] TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing (CVPR 2022) This repository provides the official PyTorch impleme

Billy XU 128 Jan 03, 2023
The official repo for CVPR2021——ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search.

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search [paper] Introduction This is the official implementation of ViPNAS: Efficient V

Lumin 42 Sep 26, 2022
Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

JBHI-Pytorch This repository contains a reference implementation of the algorithms described in our paper "Self-supervised Multi-modal Hybrid Fusion N

FeiyiFANG 5 Dec 13, 2021
Source code of the paper Meta-learning with an Adaptive Task Scheduler.

ATS About Source code of the paper Meta-learning with an Adaptive Task Scheduler. If you find this repository useful in your research, please cite the

Huaxiu Yao 16 Dec 26, 2022
Implementation of the Point Transformer layer, in Pytorch

Point Transformer - Pytorch Implementation of the Point Transformer self-attention layer, in Pytorch. The simple circuit above seemed to have allowed

Phil Wang 501 Jan 03, 2023
Official implementation of "StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation" (SIGGRAPH 2021)

StyleCariGAN in PyTorch Official implementation of StyleCariGAN:Caricature Generation via StyleGAN Feature Map Modulation in PyTorch Requirements PyTo

PeterZhouSZ 49 Oct 31, 2022
IRON Kaggle project done while doing IRONHACK Bootcamp where we had to analyze and use a Machine Learning Project to predict future sales

IRON Kaggle project done while doing IRONHACK Bootcamp where we had to analyze and use a Machine Learning Project to predict future sales. In this case, we ended up using XGBoost because it was the o

1 Jan 04, 2022
To build a regression model to predict the concrete compressive strength based on the different features in the training data.

Cement-Strength-Prediction Problem Statement To build a regression model to predict the concrete compressive strength based on the different features

Ashish Kumar 4 Jun 11, 2022