Code and data for ImageCoDe, a contextual vison-and-language benchmark

Overview

ImageCoDe

arxiv

This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions.

Example

Data

All collected descriptions for the training and validation set are under data/train_data.json and data/valid_data.json.

Image sets can be downloaded on Zenodo or GoogleDrive and should be unzipped in data/.

You can download from the commandline via:

wget https://zenodo.org/record/6518944/files/image-sets.zip

For ViLBERT experiments, you need to download a pretrained ViLBERT checkpoint from volta here, simply by clicking on ViLBERT in the table. Save the downloaded file as baselines/vilbert/vilbert-pretrained.bin. Since ViLBERT uses image features from Faster R-CNN, you also have to downloaded these for all ImageCoDe images here: Google Drive link. Save the file as data/rcnn-features36-36.lmdb. The same procedure applies for UNITER.

The format for data/train_data.json looks like this:

{
  "MSR-VTT-videoTrainValVideo_video2044-shot1_0": {
    "6": "a mom holding her babies in the middle of the picture, no other image intervenes with the image.",
    "7": "The image is fading between a woman holding a baby and a woman sitting with a red background. The hands of the woman sitting aren't visible."
  },
  "video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2": {
  "..."
  }
}

And the images under data/ have the following structure. Each folder contains 10 images. If the images are video frames, the number X in imgX.jpg indicates the frame number:

  .
  ├── MSR-VTT-videoTrainValVideo_video2044-shot1_0
      │   ├── img0.jpg
      │   ├── img7.jpg
      │   ├── ...
  ├── video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2
      │   ├── ...

Leaderboard

Based on this you can train your model and test on the unlabeled test set:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    "The team name on shirt is visible without a number, but all letters can be seen for team name.",
    "the player can be seen with him on the left close to the logo on the pitch on the right and can be clearly seen"
  ],
  "...":
  ["..."]
}

In order to appear on the leaderboard, please format your results in the following format:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    1,
    2
  ],
  "...":
  ["..."]
}

Where the example here with "1" and "2" represent image indices ranging from 0 to 9. You can submit to the leaderboard by sending your test set file (or a download link) to [email protected] and we will update the leaderboard quickly (max. 1-2 days). The leaderboard is maintained on the project website and might change its submission procedure at some point.

Installations

Run install.sh for running CLIP experiments. For VilBERT follow the instructions for volta.

Code

Code for CLIP is under baselines/clip and and code for ViLBERT/UNITER is under baselines/crossencoders.

For details commands to run each model variant shown in the paper, have a look at the README in baselines.

For example to train the best performing model CLIP+TemporalEmbeddings, run:

python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional

Data Analysis

Our manual annotation of various phenomena (negation, nuances, ...) in our validation set can be found under data/manual_annotation_valid.yaml

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you want to cite our paper, please use:

@inproceedings{krojer_contextual_2022,
  address = {Online},
  title = {Image Retrieval from Contextual Descriptions},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  month = may,
  year = {2022},
}

Acknowledgement

Our data (specifically the image sets) are built upon 3 video dataset and Open Images:

We also the volta repository for ViLBERT and UNITER baseline variants

For questions or feedback, don't hesitate to contact the author: [email protected]

Owner
McGill NLP
Research group within McGill University and Mila focusing on various topics in natural language processing.
McGill NLP
Official DGL implementation of "Rethinking High-order Graph Convolutional Networks"

SE Aggregation This is the implementation for Rethinking High-order Graph Convolutional Networks. Here we show the codes for citation networks as an e

Tianqi Zhang (张天启) 32 Jul 19, 2022
J.A.R.V.I.S is an AI virtual assistant made in python.

J.A.R.V.I.S is an AI virtual assistant made in python. Running JARVIS Without Python To run JARVIS without python: 1. Head over to our installation pa

somePythonProgrammer 16 Dec 29, 2022
python debugger and anti-vm that checks if you're in a virtual machine or if someones trying to debug your file

Anti-Debug was made by Love ❌ code ✅ 🎉 ・What it checks for ・ Kills tools that can be used to debug your file ・ Exits if ran in vm (supports different

Rdimo 31 Aug 09, 2022
The GitHub repository for the paper: “Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction“.

SCINet This is the original PyTorch implementation of the following work: Time Series is a Special Sequence: Forecasting with Sample Convolution and I

386 Jan 01, 2023
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

Microsoft 125 Jan 04, 2023
How to use TensorLayer

How to use TensorLayer While research in Deep Learning continues to improve the world, we use a bunch of tricks to implement algorithms with TensorLay

zhangrui 349 Dec 07, 2022
The project page of paper: Architecture disentanglement for deep neural networks [ICCV 2021, oral]

This is the project page for the paper: Architecture Disentanglement for Deep Neural Networks, Jie Hu, Liujuan Cao, Tong Tong, Ye Qixiang, ShengChuan

Jie Hu 15 Aug 30, 2022
Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable.

Diffrax Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable. Diffrax is a JAX-based library providing numerical differe

Patrick Kidger 717 Jan 09, 2023
[CVPR 2021] NormalFusion: Real-Time Acquisition of Surface Normals for High-Resolution RGB-D Scanning

NormalFusion: Real-Time Acquisition of Surface Normals for High-Resolution RGB-D Scanning Project Page | Paper | Supplemental material #1 | Supplement

KAIST VCLAB 49 Nov 24, 2022
This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

Wonyong Jeong 15 Nov 21, 2022
SAN for Product Attributes Prediction

SAN Heterogeneous Star Graph Attention Network for Product Attributes Prediction This repository contains the official PyTorch implementation for ADVI

Xuejiao Zhao 9 Dec 12, 2022
Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Complex-Valued Neural Networks (CVNN) Done by @NEGU93 - J. Agustin Barrachina Using this library, the only difference with a Tensorflow code is that y

youceF 1 Nov 12, 2021
Trainable Bilateral Filter Layer (PyTorch)

Trainable Bilateral Filter Layer (PyTorch) This repository contains our GPU-accelerated trainable bilateral filter layer (three spatial and one range

FabianWagner 26 Dec 25, 2022
An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

CV Lab @ Yonsei University 35 Oct 26, 2022
Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.

12 Jan 13, 2022
Hierarchical Motion Encoder-Decoder Network for Trajectory Forecasting (HMNet)

Hierarchical Motion Encoder-Decoder Network for Trajectory Forecasting (HMNet) Our paper: https://arxiv.org/abs/2111.13324 We will release the complet

15 Oct 17, 2022
HHP-Net: A light Heteroscedastic neural network for Head Pose estimation with uncertainty

HHP-Net: A light Heteroscedastic neural network for Head Pose estimation with uncertainty Giorgio Cantarini, Francesca Odone, Nicoletta Noceti, Federi

18 Aug 02, 2022
a morph transfer UGATIT for image translation.

Morph-UGATIT a morph transfer UGATIT for image translation. Introduction 中文技术文档 This is Pytorch implementation of UGATIT, paper "U-GAT-IT: Unsupervise

55 Nov 14, 2022
Discovering Interpretable GAN Controls [NeurIPS 2020]

GANSpace: Discovering Interpretable GAN Controls Figure 1: Sequences of image edits performed using control discovered with our method, applied to thr

Erik Härkönen 1.7k Jan 03, 2023
A Python training and inference implementation of Yolov5 helmet detection in Jetson Xavier nx and Jetson nano

yolov5-helmet-detection-python A Python implementation of Yolov5 to detect head or helmet in the wild in Jetson Xavier nx and Jetson nano. In Jetson X

12 Dec 05, 2022