Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Last update: Dec 07, 2022

Related tags

Overview

BPR

Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash technique into Dense Passage Retriever (DPR) to represent the passage embeddings using compact binary codes rather than continuous vectors. It substantially reduces the memory size without a loss of accuracy tested on Natural Questions and TriviaQA datasets.

BPR was originally developed to improve the computational efficiency of the Sōseki question answering system submitted to the Systems under 6GB track in the NeurIPS 2020 EfficientQA competition. Please refer to our ACL 2021 paper for further technical details.

Installation

BPR can be installed using Poetry:

poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Alternatively, you can install required libraries using pip:

pip install -r requirements.txt

Trained Models

(coming soon)

Reproducing Experiments

Before you start, you need to download the datasets available on the DPR website into <DPR_DATASET_DIR>.

The experimental results on the Natural Questions dataset can be reproduced by running the commands provided in this section. We used a server with 8 NVIDIA Tesla V100 GPUs with 16GB memory in the experiments. The results on the TriviaQA dataset can be reproduced by changing the file names of the input dataset to the corresponding ones (e.g., nq-train.json -> trivia-train.json).

1. Building passage database

python build_passage_db.py \
    --passage_file=<DPR_DATASET_DIR>/wikipedia_split/psgs_w100.tsv \
    --output_file=<PASSAGE_DB_FILE>

2. Training BPR

python train_biencoder.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<DPR_DATASET_DIR>/retriever/nq-train.json \
   --eval_file=<DPR_DATASET_DIR>/retriever/nq-dev.json \
   --gradient_clip_val=2.0 \
   --max_epochs=40 \
   --binary

3. Building passage embeddings

python generate_embeddings.py \
   --biencoder_file=<BPR_CHECKPOINT_FILE> \
   --output_file=<EMBEDDING_FILE> \
   --passage_db_file=<PASSAGE_DB_FILE> \
   --batch_size=4096 \
   --parallel

4. Evaluating BPR

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --parallel

5. Creating dataset for reader

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-train.csv \
    --output_file=<READER_TRAIN_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file=<DPR_DATASET_DIR>/retriever/qas/nq-dev.csv \
    --output_file=<READER_DEV_FILE> \
    --top_k=200 \
    --parallel

python evaluate_retriever.py \
    --binary_k=1000 \
    --biencoder_file=<BPR_CHECKPOINT_FILE> \
    --embedding_file=<EMBEDDING_FILE> \
    --passage_db_file=<PASSAGE_DB_FILE> \
    --qa_file==<DPR_DATASET_DIR>/retriever/qas/nq-test.csv \
    --output_file=<READER_TEST_FILE> \
    --top_k=200 \
    --parallel

6. Training reader

python train_reader.py \
   --gpus=8 \
   --distributed_backend=ddp \
   --train_file=<READER_TRAIN_FILE> \
   --validation_file=<READER_DEV_FILE> \
   --test_file=<READER_TEST_FILE> \
   --learning_rate=2e-5 \
   --max_epochs=20 \
   --accumulate_grad_batches=4 \
   --nq_gold_train_file=<DPR_DATASET_DIR>/gold_passages_info/nq_train.json \
   --nq_gold_validation_file=<DPR_DATASET_DIR>/gold_passages_info/nq_dev.json \
   --nq_gold_test_file=<DPR_DATASET_DIR>/gold_passages_info/nq_test.json \
   --train_batch_size=1 \
   --eval_batch_size=2 \
   --gradient_clip_val=2.0

7. Evaluating reader

python evaluate_reader.py \
    --gpus=8 \
    --distributed_backend=ddp \
    --checkpoint_file=<READER_CHECKPOINT_FILE> \
    --eval_batch_size=1

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you find this work useful, please cite the following paper:

@inproceedings{yamada2021bpr,
  title={Efficient Passage Retrieval with Hashing for Open-domain Question Answering},
  author={Ikuya Yamada and Akari Asai and Hannaneh Hajishirzi},
  booktitle={ACL},
  year={2021}
}

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Related tags

Overview

BPR

Installation

Trained Models

Reproducing Experiments

License

Citation

Owner

Studio Ousia

OREO: Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning (NeurIPS 2021)

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Code for classifying international patents based on the text of their titles/abstracts

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI

This repository is all about spending some time the with the original problem posed by Minsky and Papert

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

A quantum game modeling of pandemic (QHack 2022)

Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Simple tutorials on Pytorch DDP training

Code repository for the paper Computer Vision User Entity Behavior Analytics

Notepy is a full-featured Notepad Python app

Reimplementation of the paper "Attention, Learn to Solve Routing Problems!" in jax/flax.

Trading environnement for RL agents, backtesting and training.

Compositional Sketch Search

Unsupervised Representation Learning by Invariance Propagation

[SIGGRAPH 2021 Asia] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning

JDet is Object Detection Framework based on Jittor.

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

Implementation for the "Surface Reconstruction from 3D Line Segments" paper.

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Related tags

Overview

BPR

Installation

Trained Models

Reproducing Experiments

License

Citation

Owner

Studio Ousia

OREO: Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning (NeurIPS 2021)

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Code for classifying international patents based on the text of their titles/abstracts

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

This repository is all about spending some time the with the original problem posed by Minsky and Papert

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

A quantum game modeling of pandemic (QHack 2022)

Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Simple tutorials on Pytorch DDP training

Code repository for the paper Computer Vision User Entity Behavior Analytics

Notepy is a full-featured Notepad Python app

Reimplementation of the paper "Attention, Learn to Solve Routing Problems!" in jax/flax.

Trading environnement for RL agents, backtesting and training.

Compositional Sketch Search

Unsupervised Representation Learning by Invariance Propagation

[SIGGRAPH 2021 Asia] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning

JDet is Object Detection Framework based on Jittor.

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

Implementation for the "Surface Reconstruction from 3D Line Segments" paper.

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI