Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Related tags

Text Data & NLPGAR
Overview

This repo provides the code of the following papers:

(GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021

(RIDER) "Reader-Guided Passage Reranking for Open-Domain Question Answering", Findings of ACL 2021.

GAR augments a question with relevant contexts generated by seq2seq learning, with the question as input and target outputs such as the answer, the sentence where the answer belongs to, and the title of a passage that contains the answer. With the generated contexts appended to the original questions, GAR achieves state-of-the-art OpenQA performance with a simple BM25 retriever.

RIDER is a simple and effective passage reranker, which reranks retrieved passages by reader predictions without any training. RIDER achieves 10~20 gains in top-1 retrieval accuracy, 1~4 gains in Exact Match (EM), and even outperforms supervised transformer-based rerankers.

Code

Generation

The codebase of seq2seq models is based on (old) huggingface/transformers (version==2.11.0) examples.

See train_gen.yml for the package requirements and example commands to run the models.

train_generator.py: training of seq2seq models.

conf.py: configurations for train_generator.py. There are some default parameters but it might be easier to set e.g., --data_dir and --output_dir directly.

test_generator.py: test of seq2seq models (if not already done in train_generator.py).

Retrieval

We use pyserini for BM25 retrieval. Please refer to its document for indexing and searching wiki passages (wiki passages can be downloaded here). Alternatively, you may take a look at its effort to reproduce DPR results, which gives more detailed instructions and incorporates the passage-level span voting in GAR.

Reranking

Please see the instructions in rider/rider.py.

Reading

We experiment with one extractive reader and one generative reader.

For the extractive reader, we take the one used by dense passage retrieval. Please refer to DPR for more details.

For the generative reader, we reuse the codebase in the generation stage above, with [question; top-retrieved passages] as the source input and one ground-truth answer as the target output. Example script is provided in train_gen.yml.

Data

Please refer to DPR for dataset downloading.

For seq2seq learning, use {train/val/test}.source as the input and {train/val/test}.target as the output, where each line is one example.

In the same folder, save the list of ground-truth answers with name {val/test}.target.json if you want to evaluate EM during training.

Cite

Please use the following bibtex to cite our papers.

@article{mao2020generation,
  title={Generation-augmented retrieval for open-domain question answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.08553},
  year={2020}
}

@article{mao2021reader,
  title={Reader-Guided Passage Reranking for Open-Domain Question Answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2101.00294}
}

Owner
morning
NLP | ML | Data Mining
morning
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022