Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Related tags

Text Data & NLPGAR
Overview

This repo provides the code of the following papers:

(GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021

(RIDER) "Reader-Guided Passage Reranking for Open-Domain Question Answering", Findings of ACL 2021.

GAR augments a question with relevant contexts generated by seq2seq learning, with the question as input and target outputs such as the answer, the sentence where the answer belongs to, and the title of a passage that contains the answer. With the generated contexts appended to the original questions, GAR achieves state-of-the-art OpenQA performance with a simple BM25 retriever.

RIDER is a simple and effective passage reranker, which reranks retrieved passages by reader predictions without any training. RIDER achieves 10~20 gains in top-1 retrieval accuracy, 1~4 gains in Exact Match (EM), and even outperforms supervised transformer-based rerankers.

Code

Generation

The codebase of seq2seq models is based on (old) huggingface/transformers (version==2.11.0) examples.

See train_gen.yml for the package requirements and example commands to run the models.

train_generator.py: training of seq2seq models.

conf.py: configurations for train_generator.py. There are some default parameters but it might be easier to set e.g., --data_dir and --output_dir directly.

test_generator.py: test of seq2seq models (if not already done in train_generator.py).

Retrieval

We use pyserini for BM25 retrieval. Please refer to its document for indexing and searching wiki passages (wiki passages can be downloaded here). Alternatively, you may take a look at its effort to reproduce DPR results, which gives more detailed instructions and incorporates the passage-level span voting in GAR.

Reranking

Please see the instructions in rider/rider.py.

Reading

We experiment with one extractive reader and one generative reader.

For the extractive reader, we take the one used by dense passage retrieval. Please refer to DPR for more details.

For the generative reader, we reuse the codebase in the generation stage above, with [question; top-retrieved passages] as the source input and one ground-truth answer as the target output. Example script is provided in train_gen.yml.

Data

Please refer to DPR for dataset downloading.

For seq2seq learning, use {train/val/test}.source as the input and {train/val/test}.target as the output, where each line is one example.

In the same folder, save the list of ground-truth answers with name {val/test}.target.json if you want to evaluate EM during training.

Cite

Please use the following bibtex to cite our papers.

@article{mao2020generation,
  title={Generation-augmented retrieval for open-domain question answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.08553},
  year={2020}
}

@article{mao2021reader,
  title={Reader-Guided Passage Reranking for Open-Domain Question Answering},
  author={Mao, Yuning and He, Pengcheng and Liu, Xiaodong and Shen, Yelong and Gao, Jianfeng and Han, Jiawei and Chen, Weizhu},
  journal={arXiv preprint arXiv:2101.00294}
}

Owner
morning
NLP | ML | Data Mining
morning
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

Keevo 5 Apr 15, 2022
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Nmt - TensorFlow Neural Machine Translation Tutorial

Neural Machine Translation (seq2seq) Tutorial Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tut

6.1k Dec 29, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022