PIZZA - a task-oriented semantic parsing dataset

Overview

PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR).

Below are two examples of orders that can be found in the data:

{
    "dev.SRC": "five medium pizzas with tomatoes and ham",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 5 ) (SIZE MEDIUM ) (TOPPING HAM ) (TOPPING TOMATOES ) ) )",
    "dev.TOP": "(ORDER (PIZZAORDER (NUMBER five ) (SIZE medium ) pizzas with (TOPPING tomatoes ) and (TOPPING ham ) ) )"
}
{
    "dev.SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (COMPLEX_TOPPING (QUANTITY EXTRA ) (TOPPING CHEESE ) ) (TOPPING PEPPERONI ) ) (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (TOPPING OLIVES ) (TOPPING SAUSAGE ) ) (PIZZAORDER (NUMBER 3 ) (SIZE LARGE ) (TOPPING PEPPERONI ) (TOPPING SAUSAGE ) ) )",
    "dev.TOP": "(ORDER i want to order (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING sausage ) and (TOPPING black olives ) ) and (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING pepperoni ) and (COMPLEX_TOPPING (QUANTITY extra ) (TOPPING cheese ) ) ) and (PIZZAORDER (NUMBER three ) (SIZE large ) pizzas with (TOPPING pepperoni ) and (TOPPING sausage ) ) )"
}

While more details on the dataset conventions and construction can be found in the paper, we give a high level idea of the main rules our target semantics follow:

  • Each order can include any number of pizza and/or drink suborders. These suborders are labeled with the constructors PIZZAORDER and DRINKORDER, respectively.
  • Each top-level order is always labeled with the root constructor ORDER.
  • Both pizza and drink orders can have NUMBER and SIZE attributes.
  • A pizza order can have any number of TOPPING attributes, each of which can be negated. Negative particles can have larger scope with the use of the or particle, e.g., no peppers or onions will negate both peppers and onions.
  • Toppings can be modified by quantifiers such as a lot or extra, a little, etc.
  • A pizza order can have a STYLE attribute (e.g., thin crust style or chicago style).
  • Styles can be negated.
  • Each drink order must have a DRINKTYPE (e.g. coke), and can also have a CONTAINERTYPE (e.g. bottle) and/or a volume modifier (e.g., three 20 fl ounce coke cans).

We view ORDER, PIZZAORDER, and DRINKORDER as intents, and the rest of the semantic constructors as composite slots, with the exception of the leaf constructors, which are viewed as entities (resolved slot values).

Dataset statistics

In the below table we give high level statistics of the data.

Train Dev Test
Number of utterances 2,456,446 348 1,357
Unique entities 367 109 180
Avg entities per utterance 5.32 5.37 5.42
Avg intents per utterance 1.76 1.25 1.28

More details can be found in appendix of our publication.

Getting Started

The repo structure is as follows:

PIZZA
|
|_____ data
|      |_____ PIZZA_train.json.zip             # a zipped version of the training data
|      |_____ PIZZA_train.10_percent.json.zip  # a random subset representing 10% of training data
|      |_____ PIZZA_dev.json                   # the dev portion of the data
|      |_____ PIZZA_test.json                  # the test portion of the data
|
|_____ utils
|      |_____ __init__.py
|      |_____ entity_resolution.py # entity resolution script
|      |_____ express_utils.py     # utilities
|      |_____ semantic_matchers.py # metric functions
|      |_____ sexp_reader.py       # tree reader helper functions 
|      |_____ trees.py             # tree classes and readers
|      |
|      |_____ catalogs             # directory with catalog files of entities
|      
|_____ doc 
|      |_____ PIZZA_dataset_reader_metrics_examples.ipynb
|
|_____ READMED.md

The dev and test data files are json lines files where each line represents one utterance and contains 4 keys:

  • *.SRC: the natural language order input
  • *.EXR: the ground truth target semantic representation in EXR format
  • *.TOP: the ground truth target semantic representation in TOP format
  • *.PCFG_ERR: a boolean flag indicating whether our PCFG system parsed the utterance correctly. See publication for more details

The training data file comes in a similar format, with two differences:

  • there is no train.PCFG_ERR flag since the training data is all synthetically generated hence parsable with perfect accuracy. In other words, this flag would be True for all utterances in that file.
  • there is an extra train.TOP-DECOUPLED key that is the ground truth target semantic representation in TOP-Decoupled format. See publication for more details.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License.

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
🤕 spelling exceptions builder for lazy people

🤕 spelling exceptions builder for lazy people

Vlad Bokov 3 May 12, 2022
PUA Programming Language written in Python.

pua-lang PUA Programming Language written in Python. Installation git clone https://github.com/zhaoyang97/pua-lang.git cd pua-lang pip install . Try

zy 4 Feb 19, 2022
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 02, 2023
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

blurr A library that integrates huggingface transformers with version 2 of the fastai framework Install You can now pip install blurr via pip install

ohmeow 253 Dec 31, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 05, 2022