code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Last update: Jan 02, 2023

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

(03/16/2022) upload retrieval checkpoints finetuned on COCO and Flickr

This is the official PyTorch implementation of TCL

Requirements:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
pip install transformers==4.8.1
pip install timm==0.4.9
conda install ruamel_yaml
pip install opencv-python
pip install --upgrade Pillow
pip install einops

Pre-training Datasets:

MSCOCO (2014)
Visual Genome (VG)
- Download images of part1 and part2 and combine them together
Conceptual Captions (CC3M)
- Download Train_GCC-training.tsv and Validation_GCC-1.1.0-Validation.tsv from kaggle
- Then use img2dataset to download images from downloaed tsv files
- More details
SBU Captions
- Download url from subcaptions
- Then use img2dataset to download images from url
CC12M
- Download cc12m.tsv
- Then use img2dataset to download images from the downloaed tsv file

Downstream-task Datasets:

Flickr30k
VQA v2
NLVR2
- recommend to use direct-image-download

Json Files from Pre-training and Downstream Tasks:

refer to Download in ALBEF
you need to change the image path in json files according to your downloaded images

Pre-trained checkpoint:

Pre-training:

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain.py \
--config ./configs/Pretrain.yaml \
--output_dir output/pretrain

Downstream Tasks:

Image-Text Retrieval

# zero-shot coco 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco_zeroshot \
--checkpoint output/pretrain/checkpoint_29.pth \
--evaluate

# fine-tune flickr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune coco
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco \
--checkpoint output/pretrain/checkpoint_29.pth

# zero-shot flickr 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint output/pretrain_e30_Retrieval_coco/checkpoint_best.pth \
--evaluate

VQA

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/pretrain_e30_vqa \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Visual Entailment

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/pretrain_e30_VE \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

NLVR2

# pre-train nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/pretrain_e30_NLVR_pretrain \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/pretrain_e30_NLVR \
--checkpoint output/pretrain_e30_NLVR_pretrain/checkpoint_00.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Citation:

@article{yang2022vision,
  title={Vision-Language Pre-Training with Triple Contrastive Learning},
  author={Yang, Jinyu and Duan, Jiali and Tran, Son and Xu, Yi and Chanda, Sampath and Chen, Liqun and Zeng, Belinda and Chilimbi, Trishul and Huang, Junzhou},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  year={2022}
}

Our code is largely borrowed from ALBEF

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Related tags

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

Requirements:

Pre-training Datasets:

Downstream-task Datasets:

Json Files from Pre-training and Downstream Tasks:

Pre-trained checkpoint:

Pre-training:

Downstream Tasks:

Image-Text Retrieval

VQA

Visual Entailment

NLVR2

Citation:

Owner

GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

SymPy-powered, Wolfram|Alpha-like answer engine totally in your browser, without backend computation

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Prompt Tuning with Rules

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

🧠 A PyTorch implementation of 'Deep CORAL: Correlation Alignment for Deep Domain Adaptation.', ECCV 2016

REGTR: End-to-end Point Cloud Correspondences with Transformers

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Facebook AI Image Similarity Challenge: Descriptor Track

Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

Make your own game in a font!

TrackTech: Real-time tracking of subjects and objects on multiple cameras

Official PyTorch implementation of the paper "TEMOS: Generating diverse human motions from textual descriptions"

Swin-Transformer is basically a hierarchical Transformer whose representation is computed with shifted windows.

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Band-Adaptive Spectral-Spatial Feature Learning Neural Network for Hyperspectral Image Classification

AquaTimer - Programmable Timer for Aquariums based on ATtiny414/814/1614

Dogs classification with Deep Metric Learning using some popular losses

Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper