This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Overview

Splinter

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.

Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.

Data

Downloading Few-Shot MRQA Splits

curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot

Pretrained Model

Command for downloading Splinter
curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter 

Pretraining

Create a virtual environment and execute

cd pretraining
pip install -r requirements.txt  # or requirements-gpu.txt for a GPU version

Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:

  • For wiki, a tag starts a new article and a ends it.
  • For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.
Command for creating the pretraining data

This command takes as input a set of files ($INPUT_PATTERN) and creates a tensorized dataset for pretraining. It supports the following masking schemes:

Command for creating the data for Splinter (recurring span selection)
cd pretraining
python create_pretraining_data.py \
    --input_file=$INPUT_PATTERN \
    --output_dir=$OUTPUT_DIR \
    --vocab_file=vocabs/bert-cased-vocab.txt \
    --do_lower_case=False \
    --do_whole_word_mask=False \
    --max_seq_length=512 \
    --num_processes=63 \
    --dupe_factor=5 \
    --max_span_length=10 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30

n-gram statistics are written to ngrams.txt in the output directory.

Command for pretraining Splinter
cd pretraining
python run_pretraining.py \
    --bert_config_file=configs/bert-base-cased-config.json \
    --input_file=$INPUT_FILE \
    --output_dir=$OUTPUT_DIR \
    --max_seq_length=512 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30 \
    --do_train \
    --train_batch_size=256 \
    --learning_rate=1e-4 \
    --num_train_steps=2400000 \
    --num_warmup_steps=10000 \
    --save_checkpoints_steps=10000 \
    --keep_checkpoint_max=240 \
    --use_tpu \
    --num_tpu_cores=8 \
    --tpu_name=$TPU_NAME

This can be trained using GPUs by dropping the use_tpu flag (although it was tested mainly on TPUs).

Convert TensorFlow Model to PyTorch

In order to fine-tune the TF model you pretrained with run_pretraining.py, you will first need to convert it to PyTorch. You can do so by

cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH

Fine-tuning

Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute

cd finetuning
pip install -r requirements.txt

Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to be augmented with a [QUESTION] token. In order to do so, please run

cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"

This will add a [MASK] token to each question in the training data, which will later be replaced by a [QUESTION] token automatically by the QASS layer implementation.

Then fine-tune Splinter by

cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
    --predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=384 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=12 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=10 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=False 

In order to train with automatic mixed precision, install apex and add the --fp16 flag.

See an example script for fine-tuning SpanBERT (rather than Splinter) here.

Citation

If you find this work helpful, please cite us

@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    pages = "3066--3079",
}

Acknowledgements

We would like to thank the European Research Council (ERC) for funding the project, and to Google’s TPU Research Cloud (TRC) for their support in providing TPUs.

Owner
Ori Ram
PhD Candidate at Tel Aviv University, focusing on NLP and Machine Learning
Ori Ram
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

Arshad Shaikh 63 Dec 25, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架 具备数据预处理、模型训练、解码、计算wer等等功能 训练数据 训练数据采用thchs_30,

4 Dec 09, 2021
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022