A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Overview

Commonsense-Dialogues Dataset

We present Commonsense-Dialogues, a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.

For the collection of the Commonsense-Dialogues dataset, each Turker was presented a social context and asked to write a dialogue of 4-6 turns between two people based on the event(s) described in the context. The Turker was asked to alternate between the roles of an individual referenced in the context and a 3rd party friend. See the following dialogues as examples:

    "1": {  # dialogue_id
        "context": "Sydney met Carson's mother for the first time last week. He liked her.",   # multiple individuals in the context: Sydney and Carson
        "speaker": "Sydney",   # role 1 = Sydney, role 2 = a third-person friend of Sydney
        "turns": [
            "I met Carson's mother last week for the first time.",
            "How was she?",
            "She turned out to be really nice. I like her.",
            "That's good to hear.",
            "It is, especially since Carson and I are getting serious.",
            "Well, at least you'll like your in-law if you guys get married."
        ]
    }

    "2": {
        "context": "Kendall had a party at Jordan's house but was found out to not have asked and just broke in.",
        "speaker": "Kendall",
        "turns": [
            "Did you hear about my party this weekend at Jordan\u2019s house?",
            "I heard it was amazing, but that you broke in.",
            "That was a misunderstanding, I had permission to be there.",
            "Who gave you permission?",
            "I talked to Jordan about it months ago before he left town to go to school, but he forgot to tell his roommates about it.",
            "Ok cool, I hope everything gets resolved."
        ]
    }

The data can be found in the /data directory of this repo. train.json has ~9K dialogues, valid.json and test.json have ~1K dialogues each. Since all the contexts were sourced from the train split of SocialIQA, it is imperative to note that any form of multi-task training and evaluation with Commonsense-Dialogues and SocialIQA must be done with caution to ensure fair and accurate conclusions.

Some statistics about the data are provided below:

Stat Train Valid Test
# of dialogues 9058 1157 1158
average # of turns in a dialogue 5.72 5.72 5.71
average # of words in a turn 12.4 12.4 12.2
# of distinct SocialIQA contexts used 3672 483 473
average # of dialogues for a SocialIQA context 2.46 2.395 2.45

Security

See CONTRIBUTING for more information.

License

This repository is licensed under the CC-BY-NC 4.0 License.

Citation

If you use this dataset, please cite the following paper:

@inproceedings{zhou-etal-2021-commonsense,
    title = "Commonsense-Focused Dialogues for Response Generation: An Empirical Study",
    author = "Zhou, Pei  and
      Gopalakrishnan, Karthik  and
      Hedayatnia, Behnam  and
      Kim, Seokhwan  and
      Pujara, Jay  and
      Ren, Xiang  and
      Liu, Yang  and
      Hakkani-Tur, Dilek",
    booktitle = "Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    year = "2021",
    address = "Singapore and Online",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.06427"
}

Note that the paper uses newly collected dialogues as well as those that were filtered from existing datasets. This repo contains our newly collected dialogues alone.

Owner
Alexa
Alexa
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Sergei Averkiev 76 Dec 14, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
基于百度的语音识别,用python实现,pyaudio+pyqt

Speech-recognition 基于百度的语音识别,python3.8(conda)+pyaudio+pyqt+baidu-aip 百度有面向python

J-L 1 Jan 03, 2022