EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Overview

EntityQuestions

This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-centric Questions Challenge Dense Retrievers by Chris Sciavolino*, Zexuan Zhong*, Jinhyuk Lee, and Danqi Chen (* equal contribution).

[9/16/21] This repo is not yet set in stone, we're still putting finishing touches on the tooling and documentation :) Thanks for your patience!

Quick Links

Installation

You can download a .zip file of the dataset here, or using wget with the command:

$ wget https://nlp.cs.princeton.edu/projects/entity-questions/dataset.zip

We include the dependencies needed to run the code in this repository. We recommend having a separate miniconda environment for running DPR code. You can create the environment using the following commands:

$ conda create -n EntityQ python=3.6
$ conda activate EntityQ
$ pip install -r requirements.txt

Dataset Overview

The unzipped dataset directory should have the following structure:

dataset/
    | train/
        | P*.train.json     // all randomly sampled training files 
    | dev/
        | P*.dev.json       // all randomly sampled development files
    | test/
        | P*.test.json      // all randomly sampled testing files
    | one-off/
        | common-random-buckets/
            | P*/
                | bucket*.test.json
        | no-overlap/
            | P*/
                | P*_no_overlap.{train,dev,test}.json
        | nq-seen-buckets/
            | P*/
                bucket*.test.json
        | similar/
            | P*
                | P*_similar.{train,dev,test}.json

The main dataset is included in dataset/ under train/, dev/, and test/, each containing the randomly sampled training, development, and testing subsets, respectively. For example, the evaluation set for place-of-birth (P19) can be found in the dataset/test/P19.test.json file.

We also include all of the one-off datasets we used to generate the tables/figures presented in the paper under dataset/one-off/, explained below:

  • one-off/common-random-buckets/ contains buckets of 1,000 randomly sampled examples, used to produce Fig. 1 of the paper (specifically for rand-ent).
  • one-off/no-overlap/ contains the training/development splits for our analyses in Section 4.1 of the paper (we do not use the testing split in our analysis). These training/development sets have subject entities with no token overlap with subject entities of the randomly sampled test set (specifically for all fine-tuning in Table 2).
  • one-off/nq-seen-buckets/ contains buckets of questions with subject entities that overlap with subject entities seen in the NQ training set, used to produce Fig. 1 of the paper (specifically for train-ent).
  • one-off/similar contains the training/development splits for the syntactically different but symantically equal question sets, used for our analyses in Section 4.1 (specifically the similar rows). Again, we do not use the testing split in our analysis. These questions are identical to one-off/no-overlap/ but use a different question template.

Retrieving DPR Results

Our analysis is based on a previous version of the DPR repository (specifically the Oct. 5 version w. hash 27a8436b070861e2fff481e37244009b48c29c09), so our commands may not be up-to-date with the March 2021 release. That said, most of the commands should be clearly transferable.

First, we recommend following the setup guide from the official DPR repository. Once set up, you can download the relevant pre-trained models/indices using their download_data.py script. For our analysis, we used the DPR-NQ model and the DPR-Multi model. To run retrieval using a pre-trained model, you'll minimally need to download:

  1. The pre-trained model
  2. The Wikipedia passage splits
  3. The encoded Wikipedia passage FAISS index
  4. A question/answer dataset

With this, you can use the following python command:

python dense_retriever.py \
    --batch_size 512 \
    --model_file "path/to/pretrained/model/file.cp" \
    --qa_file "path/to/qa/dataset/to/evaluate.json" \
    --ctx_file "path/to/wikipedia/passage/splits.tsv" \
    --encoded_ctx_file "path/to/encoded/wikipedia/passage/index/" \
    --save_or_load_index \
    --n-docs 100 \
    --validation_workers 1 \
    --out_file "path/to/desired/output/location.json"

We had access to a single 11Gb Nvidia RTX 2080Ti GPU w. 128G of RAM when running DPR retrieval.

Retrieving BM25 Results

We use the Pyserini implementation of BM25 for our analysis. We use the default settings and index on the same passage splits downloaded from the DPR repository. We include steps to re-create our BM25 results below.

First, we need to pre-process the DPR passage splits into the proper format for BM25 indexing. We include this file in bm25/build_bm25_ctx_passages.py. Rather than writing all passages into a single file, you can optionally shard the passages into multiple files (specified by the n_shards argument). It also creates a mapping from the passage ID to the title of the article the passage is from. You can use this file as follows:

python bm25/build_bm25_ctx_passages.py \
    --wiki_passages_file "path/to/wikipedia/passage/splits.tsv" \
    --outdir "path/to/desired/output/directory/" \
    --title_index_path "path/to/desired/output/directory/.json" \
    --n_shards number_of_shards_of_passages_to_write

Now that you have all the passages in files, you can build the BM25 index using the following command:

python -m pyserini.index -collection JsonCollection \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -input "path/to/generated/passages/folder/" \
    -index "path/to/desired/index/folder/" \
    -storePositions -storeDocvectors -storeRaw

Once the index is built, you can use it in the bm25/bm25_retriever.py script to get retrieval results for an input file:

python bm25/bm25_retriever.py \
    --index_path "path/to/built/bm25/index/directory/" \
    --passage_id_to_title_path "path/to/title_index_path/from_step_1.json" \
    --input "path/to/input/qa/file.json" \
    --output_dir "path/to/output/directory/"

By default, the script will retrieve 100 passages (--n_docs), use string matching to determine answer presence (--answer_type), and take in .json files (--input_file_type). You can optionally provide a glob using the --glob flag. The script writes the results to the file with the same name as the input file, but in the output directory.

Evaluating Retriever Results

We provide an evaluation script in utils/accuracy.py. The expected format is equivalent to DPR's output format. It either accepts a single file to evaluate, or a glob of multiple files if the --glob option is set. To evaluate a single file, you can use the following command:

python utils/accuracy.py \
    --results "path/to/retrieval/results.json" \
    --k_values 1,5,20,100

or with a glob with:

python utils/accuracy.py \
    --results="path/to/glob*.test.json" \
    --glob \
    --k_values 1,5,20,100

Bugs or Questions?

Feel free to open an issue on this GitHub repository and we'd be happy to answer your questions as best we can!

Citation

If you use our dataset or code in your research, please cite our work:

@inproceedings{sciavolino2021simple,
   title={Simple Entity-centric Questions Challenge Dense Retrievers},
   author={Sciavolino, Christopher and Zhong, Zexuan and Lee, Jinhyuk and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

WASP2 (Currently in pre-development): Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis Requ

McVicker Lab 2 Aug 11, 2022
A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Awesome PyTorch Scholarship Resources A collection of awesome PyTorch and Python learning resources. Contributions are always welcome! Course Informat

Arnas Gečas 302 Dec 03, 2022
Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Poisson-solver-2D Finite difference solution of 2D Poisson equation Current version can handle Dirichlet, Neumann, and mixed (combination of Dirichlet

Mohammad Asif Zaman 34 Dec 23, 2022
Graph Self-Attention Network for Learning Spatial-Temporal Interaction Representation in Autonomous Driving

GSAN Introduction Code for paper GSAN: Graph Self-Attention Network for Learning Spatial-Temporal Interaction Representation in Autonomous Driving, wh

YE Luyao 6 Oct 27, 2022
SphereFace: Deep Hypersphere Embedding for Face Recognition

SphereFace: Deep Hypersphere Embedding for Face Recognition By Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj and Le Song License SphereFa

Weiyang Liu 1.5k Dec 29, 2022
CTF Challenge for CSAW Finals 2021

Terminal Velocity Misc CTF Challenge for CSAW Finals 2021 This is a challenge I've had in mind for almost 15 years and never got around to building un

Jordan 6 Jul 30, 2022
Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation

Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation (CVPR2019) This is a pytorch implementatio

Yawei Luo 280 Jan 01, 2023
Spam your friends and famly and when you do your famly will disown you and you will have no friends.

SpamBot9000 Spam your friends and family and when you do your family will disown you and you will have no friends. Terms of Use Disclaimer: Please onl

DJ15 0 Jun 09, 2022
Official repository for ABC-GAN

ABC-GAN The work represented in this repository is the result of a 14 week semesterthesis on photo-realistic image generation using generative adversa

IgorSusmelj 10 Jun 23, 2022
A GOOD REPRESENTATION DETECTS NOISY LABELS

A GOOD REPRESENTATION DETECTS NOISY LABELS This code is a PyTorch implementation of the paper: Prerequisites Python 3.6.9 PyTorch 1.7.1 Torchvision 0.

<a href=[email protected]"> 64 Jan 04, 2023
Rule-based Customer Segmentation

Rule-based Customer Segmentation Business Problem A game company wants to create level-based new customer definitions (personas) by using some feature

Cem Çaluk 2 Jan 03, 2022
Brain tumor detection using Convolution-Neural Network (CNN)

Detect and Classify Brain Tumor using CNN. A system performing detection and classification by using Deep Learning Algorithms using Convolution-Neural Network (CNN).

assia 1 Feb 07, 2022
Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Towards Flexible Blind JPEG Artifacts Removal (FBCNN, ICCV 2021)

Jiaxi Jiang 282 Jan 02, 2023
Yolo Traffic Light Detection With Python

Yolo-Traffic-Light-Detection This project is based on detecting the Traffic light. Pretained data is used. This application entertained both real time

Ananta Raj Pant 2 Aug 08, 2022
🤖 Project template for your next awesome AI project. 🦾

🤖 AI Awesome Project Template 👋 Template author You may want to adjust badge links in a README.md file. 💎 Installation with pip Installation is as

Wiktor Łazarski 18 Nov 23, 2022
Codebase for INVASE: Instance-wise Variable Selection - 2019 ICLR

Codebase for "INVASE: Instance-wise Variable Selection" Authors: Jinsung Yoon, James Jordon, Mihaela van der Schaar Paper: Jinsung Yoon, James Jordon,

Jinsung Yoon 50 Nov 11, 2022
ViDT: An Efficient and Effective Fully Transformer-based Object Detector

ViDT: An Efficient and Effective Fully Transformer-based Object Detector by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,

NAVER AI 262 Dec 27, 2022
Özlem Taşkın 0 Feb 23, 2022
Breaching - Breaching privacy in federated learning scenarios for vision and text

Breaching - A Framework for Attacks against Privacy in Federated Learning This P

Jonas Geiping 139 Jan 03, 2023
Implementation of "DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing".

DeepOrder Implementation of DeepOrder for the paper "DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing". Project

6 Nov 07, 2022