ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Overview

ThinkTwice

ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension. Authors are Mengxing Dong, Bowei Zou, Jin Qian, Rongtao Huang and Yu Hong from Soochow University and Institute for Infocomm Research. The paper will be published in NLPCC 2021 soon.

Contents

Background

Our idea is mainly inspired by the way humans think: We first read a lengthy document and remain several slices which are important to our task in our mind; then we are gonna capture the final answer within this limited information.

The goals for this repository are:

  1. A complete code for NewsQA. This repo offers an implement for dealing with long text MRC dataset NewsQA; you can also try this method on other datsets like TriviaQA, Natural Questions yourself.
  2. A comparison description. The performance on ThinkTwice has been listed in the paper.
  3. A public space for advice. You are welcomed to propose an issue in this repo.

Requirements

Clone this repo at your local server. Install necessary libraries listed below.

git clone [email protected]:Walle1493/ThinkTwice.git
pip install requirements.txt

You may install several libraries on yourself.

Dataset

You need to prepare data in a squad2-like format. Since NewsQA (click here seeing more) is similar to SQuAD-2.0, we don't offer the script in this repo. The demo data format is showed below:

"version": "1",
"data": [
    {
        "type": "train",
        "title": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story",
        "paragraphs": [
            {
                "context": "NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy...",
                "qas": [
                    {
                        "question": "What was the amount of children murdered?",
                        "id": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story01",
                        "answers": [
                            {
                                "answer_start": 294,
                                "text": "19"
                            }
                        ],
                        "is_impossible": false
                    },
                    {
                        "question": "When was Pandher sentenced to death?",
                        "id": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story02",
                        "answers": [
                            {
                                "answer_start": 261,
                                "text": "February"
                            }
                        ],
                        "is_impossible": false
                    }
                ]
            }
        ]
    }
]

P.S.: You are supposed to make a change when dealing with other datasets like TriviaQA or Natural Questions, because we split passages by '\n' character in NewsQA, while not all the same in other datasets.

Train

The training step (including test module) depends mainly on these parameters. We trained our two-stage model on 4 GPUs with 12G 1080Ti in 60 hours.

python code/main.py \
  --do_train \
  --do_eval \
  --eval_test \
  --model bert-base-uncased \
  --train_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-train.json \
  --dev_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-dev.json \
  --test_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-test.json \
  --train_batch_size 256 \
  --train_batch_size_2 24 \
  --eval_batch_size 32  \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --num_train_epochs_2 3 \
  --max_seq_length 128 \
  --max_seq_length_2 512 \
  --doc_stride 128 \
  --eval_metric best_f1 \
  --output_dir outputs/newsqa/retr \
  --output_dir_2 outputs/newsqa/read \
  --data_binary_dir data_binary/retr \
  --data_binary_dir_2 data_binary/read \
  --version_2_with_negative \
  --do_lower_case \
  --top_k 5 \
  --do_preprocess \
  --do_preprocess_2 \
  --first_stage \

In order to improve efficiency, we store data and model generated during training in a binary format. Specifically, when you switch on do_preprocess, the converted data in the first stage will be stored in the directory data_binary, next time you can switch off this option to directly load data. As well, do_preprocess aims at the data in the second stage, and first_stage is for the retriever model. The model and metrics result can be found in the directory output/newsqa after training.

License

Soochow University © Mengxing Dong

Owner
Walle
Walle
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023