GooAQ 🥑 : Google Answers to Google Questions!

Related tags

Text Data & NLPgooaq
Overview

GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

NOTE This dataset should not be used for any commercial purposes. See the license for the detailed terms.

Data

To get the data, see the data/ directory. Note that the data is stored via git-lfs. If you're cloning the project (git clone [email protected]:allenai/gooaq.git), make sure to also run git lfs pull as well.

Each row of the data file should look like this:

{
  "id": 3339543,
  "question": "what is the difference between collagen and whey protein?",
  "short_answer": null,
  "answer": "The main differences between the amino acid profiles of whey and collagen are that whey contains all 9 essential amino acids, while collagen only has 8. ... Collagen is a fibrous protein found in the skin, cartilage, and bones of animals whereas whey comes from milk.",
  "answer_type": "feat_snip"
}

where the questions question are collected via Google auto-complete.
The answers responses (short_answer and answer) were collected from Google's answer boxes. The answer types (answer_type) are inferred based on the html content of Google's response. Here is the dominant types in the current dataset:

  • feat_snip: explanatory responses; the majoriy the question/responses are of this type.
  • collection: list responses (e.g., steps to accomplish something).
  • knowledge: typically short responses for knowledge seeking questions.
  • unit_conv: questions about converting units.
  • time_conv: questions about converting times.
  • curr_conv: questions about converting currencies.

Here are several more examples from the data:

{
  "id": 5009708,
  "question": "carbon dioxide comprises approximately what percentage of tropospheric gases?",
  "short_answer": "04%",
  "answer": "Carbon dioxide comprise approximately . 04% of tropospheric gases.",
  "answer_type": "feat_snip"
}
{
  "id": 8317711,
  "question": "what is the distance between uranus and earth?",
  "short_answer": "1.7858 billion mi",
  "answer": null,
  "answer_type": "knowledge"
}
{
  "id": 3547745,
  "question": "what is the symbol for the element aluminum?",
  "short_answer": "Al",
  "answer": null,
  "answer_type": "knowledge"
}
{
  "id": 3552841,
  "question": "what is the volume of a 12 oz can?",
  "short_answer": "340.957",
  "answer": null,
  "answer_type": "unit_conv"
}
{
  "id": 1032187,
  "question": "exajoule is how many joules?",
  "short_answer": "1e+18 Joule",
  "answer": null,
  "answer_type": "unit_conv"
}
{
  "id": 610247,
  "question": "are words that start with e?",
  "short_answer": null,
  "answer": "['eager.', 'eagle.', 'eagre.', 'eared.', 'earls.', 'early.', 'earns.', 'earth.']",
  "answer_type": "collection"
}
{
  "id": 1309258,
  "question": "how long does it take to boil a hard egg?",
  "short_answer": null,
  "answer": "['Place your eggs in a single layer on the bottom of your pot and cover with cold water. ... ', 'Over high heat, bring your eggs to a rolling boil.', 'Remove from heat and let stand in water for 10-12 minutes for large eggs. ... ', 'Drain water and immediately run cold water over eggs until cooled.']",
  "answer_type": "collection"
}
{
  "id": 2518757,
  "question": "is ways to lose weight?",
  "short_answer": null,
  "answer": "['Trying intermittent fasting. ... ', 'Tracking your diet and exercise. ... ', 'Eating mindfully. ... ', 'Eating protein for breakfast. ... ', 'Cutting back on sugar and refined carbohydrates. ... ', 'Eating plenty of fiber. ... ', 'Balancing gut bacteria. ... ', \"Getting a good night's sleep.\"]",
  "answer_type": "collection"
}

Baselines

See the scripts for reproducing our T5 baselines, see the experiments/ directory.

Reproducing Human Evaluation

TBD

More reading

See the following paper:

@article{gooaq2021,
  title={Is Your Language Model as Knowledgeable as Google?},
  author={Khashabi, Daniel and Ng, Amos and Khot, Tushar and Sabharwal, Ashish and Hajishirzi, Hannaneh and Callison-Burch, Chris},
  journal={arXiv preprint},
  year={2021}
}
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

Rasa 169 Dec 26, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

UlionTse 907 Dec 27, 2022
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization 🤍 Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021