Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Last update: Dec 07, 2022

Overview

Conceptual 12M

We introduce the Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than the Conceptual Captions (CC3M), a dataset that is widely used for pre-training and end-to-end training of image captioning models. Check our paper for further details.

Download

Click here to download (2.5GB)

Format (.tsv)

[image_url_1]\t[caption_1]
[image_url_2]\t[caption_2]
[image_url_3]\t[caption_3]
…
[image_url_N]\t[caption_N]

Cite

If you use this dataset in your research, please cite:

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR 2021.

@inproceedings{changpinyo2021cc12m,
  title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
  author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle = {CVPR},
  year = {2021},
}

FAQs

Q1: Can you provide image pixels?

A1: We do not own any of the images in the dataset and hence cannot legally provide them to you. The owner of an image can choose to delete it at anytime, in which case the image will no longer be available. Due to this, unfortunately, some images in the dataset will be lost over time, and we are unable to help with this issue.

Q2: Is it normal that a subset of images cannot be retrieved from the provided URLs?

A2: Yes. See Q1.

Q3: Is CC12M an “expanded” CC3M?

A3: No, CC12M is on purpose designed for vision-and-language pre-training, and meant to be disjoint from CC3M. CC3M is cleaner and more appropriate for fine-tuning, but can be used in conjunction with CC12M for pre-training, as illustrated in our paper. Coincidentally, their intersection is found to be non-zero — approximately 63K URLs.

Contact Us

If you have a question not provided in the FAQs above, please create an issue in this repository.

If you would like to share feedback or report concerns, please email us at [email protected].

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Related tags

Overview

Conceptual 12M

Download

Cite

FAQs

Contact Us

Owner

Google Research Datasets

Code for Contrastive-Geometry Networks for Generalized 3D Pose Transfer

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.

Official Repository for our ECCV2020 paper: Imbalanced Continual Learning with Partitioning Reservoir Sampling

Using VideoBERT to tackle video prediction

[NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature"

A benchmark for the task of translation suggestion

Official repository for the paper "Instance-Conditioned GAN"

这个开源项目主要是对经典的时间序列预测算法论文进行复现，模型主要参考自GluonTS，框架主要参考自Informer

Code for our ALiBi method for transformer language models.

Latte: Cross-framework Python Package for Evaluation of Latent-based Generative Models

This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Using LSTM to detect spoofing attacks in an Air-Ground network

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Making a music video with Wav2CLIP and VQGAN-CLIP

AI-Fitness-Tracker - AI Fitness Tracker With Python

Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)

A set of examples around hub for creating and processing datasets

An Open Source Machine Learning Framework for Everyone

Quick program made to generate alpha and delta tables for Hidden Markov Models