[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Overview

MMChat

This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.

Dataset

MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details. The images in the dataset are hosted on Weibo's static image server. You can refer to the scripts provided in data_processing/weibo_image_crawler to download these images.

Two sample dialogues form MMChat are given below (translated from Chinese): A sample dialogue from MMChat

MMChat is released in different versions:

Rule Filtered Raw MMChat

This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:

Item Description Count
Sessions 4.257 M
Sessions with more than 4 utterances 2.304 M
Utterances 18.590 M
Images 4.874 M
Avg. utterance per session 4.367
Avg. image per session 1.670
Avg. character per utterance 14.104

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

LCCC Filtered MMChat

This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset. Specifically, some dialogues in MMChat are also contained in LCCC. We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises. This version of MMChat is obtained using the script data_processing/LCCC_filter.py The following table shows some basic statistics:

Item Description Count
Sessions 492.6 K
Sessions with more than 4 utterances 208.8 K
Utterances 1.986 M
Images 1.066 M
Avg. utterance per session 4.031
Avg. image per session 2.514
Avg. character per utterance 11.336

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

MMChat

The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:

Item Description Count
Sessions 120.84 K
Sessions with more than 4 utterances 17.32 K
Utterances 314.13 K
Images 198.82 K
Avg. utterance per session 2.599
Avg. image per session 2.791
Avg. character per utterance 8.521

The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.

MMChat-hf

We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.

Item Description Count
Sessions 19.90 K
Sessions with more than 4 utterances 8.91 K
Utterances 81.06 K
Images 52.66K
Avg. utterance per session 4.07
Avg. image per session 2.70
Avg. character per utterance 11.93

We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.

Code

We are also releasing all the codes used for our experiments. You can use the script run_training.sh in each folder to launch the distributed training.

For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features

The model shown in our paper can be found in dialog_image: Model

Reference

Please cite our paper if you find our work useful ;)

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}
@inproceedings{wang2020chinese,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.03946}
}
Owner
Silver
Dialogue System, Natural Language Processing
Silver
Swapping face using Face Mesh with TensorFlow Lite

Swapping face using Face Mesh with TensorFlow Lite

iwatake 17 Apr 26, 2022
[内测中]前向式Python环境快捷封装工具,快速将Python打包为EXE并添加CUDA、NoAVX等支持。

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具,最短只需一行命令即可将普通的Python脚本打包成EXE可执行程序,并选择性添加CUDA和NoAVX的支持,尽可能兼容更多的用户环境。 感觉还可

QPT Family 545 Dec 28, 2022
Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

ReFine: Multi-Grained Explainability for GNNs This is the official code for Towards Multi-Grained Explainability for Graph Neural Networks (NeurIPS 20

Shirley (Ying-Xin) Wu 47 Dec 16, 2022
A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis This is the pytorch implementation for our MICCAI 2021 paper. A Mul

Jiarong Ye 7 Apr 04, 2022
Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Ro

Meta Research 1.2k Jan 02, 2023
Visualizing lattice vibration information from phonon dispersion to atoms (For GPUMD)

Phonon-Vibration-Viewer (For GPUMD) Visualizing lattice vibration information from phonon dispersion for primitive atoms. In this tutorial, we will in

Liangting 6 Dec 10, 2022
curl-impersonate: A special compilation of curl that makes it impersonate Chrome & Firefox

curl-impersonate A special compilation of curl that makes it impersonate real browsers. It can impersonate the four major browsers: Chrome, Edge, Safa

lwthiker 1.9k Jan 03, 2023
Implementation for paper: Self-Regulation for Semantic Segmentation

Self-Regulation for Semantic Segmentation This is the PyTorch implementation for paper Self-Regulation for Semantic Segmentation, ICCV 2021. Citing SR

Dong ZHANG 30 Nov 21, 2022
PyTorch implementation of the Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning This is the official PyTorch implementation of the ContrastiveCrop paper: @artic

249 Dec 28, 2022
Plenoxels: Radiance Fields without Neural Networks

Plenoxels: Radiance Fields without Neural Networks Alex Yu*, Sara Fridovich-Keil*, Matthew Tancik, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa UC Be

Sara Fridovich-Keil 81 Dec 25, 2022
A PyTorch implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Source: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Caiyong Wang 14 Sep 20, 2022
Learned Initializations for Optimizing Coordinate-Based Neural Representations

Learned Initializations for Optimizing Coordinate-Based Neural Representations Project Page | Paper Matthew Tancik*1, Ben Mildenhall*1, Terrance Wang1

Matthew Tancik 127 Jan 03, 2023
Official PyTorch implementation of U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

U-GAT-IT — Official PyTorch Implementation : Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Imag

Hyeonwoo Kang 2.4k Jan 04, 2023
Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

Frequency Bias of Generative Models Generator Testbed Discriminator Testbed This repository contains official code for the paper On the Frequency Bias

35 Nov 01, 2022
Planar Prior Assisted PatchMatch Multi-View Stereo

ACMP [News] The code for ACMH is released!!! [News] The code for ACMM is released!!! About This repository contains the code for the paper Planar Prio

Qingshan Xu 127 Dec 31, 2022
Volsdf - Volume Rendering of Neural Implicit Surfaces

Volume Rendering of Neural Implicit Surfaces Project Page | Paper | Data This re

Lior Yariv 221 Jan 07, 2023
Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from

Jan Sosulski 5 Nov 07, 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning This repository contains the code for our ICCV 202

sangho.lee 28 Nov 08, 2022
DECAF: Deep Extreme Classification with Label Features

DECAF DECAF: Deep Extreme Classification with Label Features @InProceedings{Mittal21, author = "Mittal, A. and Dahiya, K. and Agrawal, S. and Sain

46 Nov 06, 2022
Automatic meme generation model using Tensorflow Keras.

Memefly You can find the project at MemeflyAI. Contributors Nick Buukhalter Harsh Desai Han Lee Project Overview Trello Board Product Canvas Automatic

BloomTech Labs 2 Jan 13, 2022