Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Last update: Nov 21, 2022

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

This repository is the PyTorch implementation of the paper:

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

We additionally include evaluation code from Luo et al. in the folder GoogleConceptualCaptioning , which has been patched for compatibility.

Requirements

The following code is written in Python 3.6.10 and CUDA 9.0.

Requirements:

torch 1.1.0
torchvision 0.3.0
nltk 3.5
inflect 4.1.0
tqdm 4.46.0
sklearn 0.0
h5py 2.10.0

To install requirements:

conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>

Preprocessed data

The dataset used in this project for assessing accuracy and diversity is COCO 2014 (m-RNN split). The full dataset is available here.

We use the Faster R-CNN features for images similar to Anderson et al.. We additionally require "classes"/"scores" fields detected for image regions. The classes correspond to Visual Genome.

Download instructions

Preprocessed training data is available here as hdf5 files. The provided hdf5 files contain the following fields:

image_id: ID of the COCO image
num_boxes: The proposal regions detected from Faster R-CNN
features: ResNet-101 features of the extracted regions
classes: Visual genome classes of the extracted regions
scores: Scores of the Visual genome classes of the extracted regions

Note that the ["image_id","num_boxes","features"] fields are identical to Anderson et al.

Create a folder named coco and download the preprocessed training and test datasets from the coco folder in the drive link above as follows (it is also possible to directly download the entire coco folder from the drive link):

Download the following files for training on COCO 2014 (m-RNN split):

coco/coco_train_2014_adaptive_withclasses.h5
coco/coco_val_2014_adaptive_withclasses.h5
coco/coco_val_mRNN.txt
coco/coco_test_mRNN.txt

Download the following files for training on held-out COCO (novel object captioning):

coco/coco_train_2014_noc_adaptive_withclasses.h5
coco/coco_train_extra_2014_noc_adaptive_withclasses.h5

Download the following files for testing on held-out COCO (novel object captioning):

coco/coco_test_2014_noc_adaptive_withclasses.h5

Download the (caption) annotation files and place them in a subdirectory coco/annotations (mirroring the Google drive folder structure)

coco/annotations/captions_train2014.json
coco/annotations/captions_val2014.json

Download the following files from the drive link in a seperate folder data (outside coco). These files contain the contextual neighbours for pseudo supervision:

data/nn_final.pkl
data/nn_noc.pkl

For running the train/test scripts (described in the following) "pathToData"/"nn_dict_path" in params.json and params_noc.json needs to be set to the coco/data folder created above.

Verify Folder Structure after Download

The folder structure of coco after data download should be as follows,

coco
 - annotations
   - captions_train2014.json
   - captions_val2014.json
 - coco_val_mRNN.txt
 - coco_test_mRNN.txt
 - coco_train_2014_adaptive_withclasses.h5
 - coco_val_2014_adaptive_withclasses.h5
 - coco_train_2014_noc_adaptive_withclasses.h5
 - coco_train_extra_2014_noc_adaptive_withclasses.h5
 - coco_test_2014_noc_adaptive_withclasses.h5
data
 - coco_classname.txt
 - visual_genome_classes.txt
 - vocab_coco_full.pkl
 - nn_final.pkl
 - nn_noc.pkl

Training

Please follow the following instructions for training:

Set hyperparameters for training in params.json and params_noc.json.
Train a model on COCO 2014 for captioning,

   	python ./scripts/train.py

Train a model for diverse novel object captioning,

   	python ./scripts/train_noc.py

Please note that the data folder provides the required vocabulary.

Memory requirements

The models were trained on a single nvidia V100 GPU with 32 GB memory. 16 GB is sufficient for training a single run.

Pre-trained models and evaluation

We provide pre-trained models for both captioning on COCO 2014 (mRNN split) and novel object captioning. Please follow the following steps:

Download the pre-trained models from here to the ckpts folder.
For evaluation of oracle scores and diversity, we follow Luo et al.. In the folder GoogleConceptualCaptioning download the cider and in the cococaption folder run the download scripts,

   	./GoogleConceptualCaptioning/cococaption/get_google_word2vec_model.sh
   	./GoogleConceptualCaptioning/cococaption/get_stanford_models.sh
   	python ./scripts/eval.py

For diversity evaluation create the required numpy file for consensus re-ranking using,

   	python ./scripts/eval_diversity.py

For consensus re-ranking follow the steps here. To obtain the final diversity scores, follow the instructions of DiversityMetrics. Convert the numpy file to required json format and run the script evalscripts.py

To evaluate the F1 score for novel object captioning,

   	python ./scripts/eval_noc.py

Results

Oracle evaluation on the COCO dataset

	B4	B3	B2	B1	CIDEr	METEOR	ROUGE	SPICE
COS-CVAE	0.633	0.739	0.842	0.942	1.893	0.450	0.770	0.339

Diversity evaluation on the COCO dataset

	Unique	Novel	mBLEU	Div-1	Div-2
COS-CVAE	96.3	4404	0.53	0.39	0.57

F1-score evaluation on the held-out COCO dataset

	bottle	bus	couch	microwave	pizza	racket	suitcase	zebra	average
COS-CVAE	35.4	83.6	53.8	63.2	86.7	69.5	46.1	81.7	65.0

Bibtex

@inproceedings{coscvae20neurips,
  title     = {Diverse Image Captioning with Context-Object Split Latent Spaces},
  author    = {Mahajan, Shweta and Roth, Stefan},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2020}
}

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Related tags

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

Requirements

Preprocessed data

Download instructions

Verify Folder Structure after Download

Training

Memory requirements

Pre-trained models and evaluation

Results

Oracle evaluation on the COCO dataset

Diversity evaluation on the COCO dataset

F1-score evaluation on the held-out COCO dataset

Bibtex

Owner

Visual Inference Lab @TU Darmstadt

Code repository for "Stable View Synthesis".

FwordCTF 2021 Infrastructure and Source code of Web/Bash challenges

Unsupervised Image to Image Translation with Generative Adversarial Networks

The all new way to turn your boring vector meshes into the new fad in town; Voxels!

Unofficial Alias-Free GAN implementation. Based on rosinality's version with expanded training and inference options.

Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

A lightweight library to compare different PyTorch implementations of the same network architecture.

Multiview 3D object detection on MultiviewC dataset through moft3d.

This tool uses Deep Learning to help you draw and write with your hand and webcam.

This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

Official implementation of paper Gradient Matching for Domain Generalization

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

SSD: Single Shot MultiBox Detector pytorch implementation focusing on simplicity

Automatic Idiomatic Expression Detection

Object DGCNN and DETR3D, Our implementations are built on top of MMdetection3D.

TensorFlow implementation of ENet

Official implementation of Monocular Quasi-Dense 3D Object Tracking

Code for CPM-2 Pre-Train

Unofficial PyTorch implementation of Google AI's VoiceFilter system