Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

Last update: Dec 18, 2022

Overview

VL-BERT

By Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai.

This repository is an official implementation of the paper VL-BERT: Pre-training of Generic Visual-Linguistic Representations.

Update on 2020/01/16 Add code of visualization.

Update on 2019/12/20 Our VL-BERT got accepted by ICLR 2020.

Introduction

VL-BERT is a simple yet powerful pre-trainable generic representation for visual-linguistic tasks. It is pre-trained on the massive-scale caption dataset and text-only corpus, and can be fine-tuned for various down-stream visual-linguistic tasks, such as Visual Commonsense Reasoning, Visual Question Answering and Referring Expression Comprehension.

Thanks to PyTorch and its 3rd-party libraries, this codebase also contains following features:

Distributed Training
FP16 Mixed-Precision Training
Various Optimizers and Learning Rate Schedulers
Gradient Accumulation
Monitoring the Training Using TensorboardX

Citing VL-BERT

@inproceedings{
  Su2020VL-BERT:,
  title={VL-BERT: Pre-training of Generic Visual-Linguistic Representations},
  author={Weijie Su and Xizhou Zhu and Yue Cao and Bin Li and Lewei Lu and Furu Wei and Jifeng Dai},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=SygXPaEYvH}
}

Prepare

Environment

Ubuntu 16.04, CUDA 9.0, GCC 4.9.4

Python 3.6.x

# We recommend you to use Anaconda/Miniconda to create a conda environment
conda create -n vl-bert python=3.6 pip
conda activate vl-bert

PyTorch 1.0.0 or 1.1.0

conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch

Apex (optional, for speed-up and fp16 training)

git clone https://github.com/jackroos/apex
cd ./apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Other requirements:

pip install Cython
pip install -r requirements.txt

Compile
```
./scripts/init.sh
```

Data

See PREPARE_DATA.md.

Pre-trained Models

See PREPARE_PRETRAINED_MODELS.md.

Training

Distributed Training on Single-Machine

./scripts/dist_run_single.sh <num_gpus> <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

<num_gpus>: number of gpus to use.
<task>: pretrain/vcr/vqa/refcoco.
<path_to_cfg>: config yaml file under ./cfgs/<task>.
<dir_to_store_checkpoint>: root directory to store checkpoints.

Following is a more concrete example:

./scripts/dist_run_single.sh 4 vcr/train_end2end.py ./cfgs/vcr/base_q2a_4x16G_fp32.yaml ./

Distributed Training on Multi-Machine

For example, on 2 machines (A and B), each with 4 GPUs,

run following command on machine A:

./scripts/dist_run_multi.sh 2 0 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

run following command on machine B:

./scripts/dist_run_multi.sh 2 1 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Non-Distributed Training

./scripts/nondist_run.sh <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Note:

In yaml files under ./cfgs, we set batch size for GPUs with at least 16G memory, you may need to adapt the batch size and gradient accumulation steps according to your actual case, e.g., if you decrease the batch size, you should also increase the gradient accumulation steps accordingly to keep 'actual' batch size for SGD unchanged.
For efficiency, we recommend you to use distributed training even on single-machine. But for RefCOCO+, you may meet deadlock using distributed training due to unknown reason (it may be related to PyTorch dataloader deadloack), you can simply use non-distributed training to solve this problem.

Evaluation

VCR

Local evaluation on val set:

python vcr/val.py \
  --a-cfg <cfg_of_q2a> --r-cfg <cfg_of_qa2r> \
  --a-ckpt <checkpoint_of_q2a> --r-ckpt <checkpoint_of_qa2r> \
  --gpus <indexes_of_gpus_to_use> \
  --result-path <dir_to_save_result> --result-name <result_file_name>

Note: <indexes_of_gpus_to_use> is gpu indexes, e.g., 0 1 2 3.

Generate prediction results on test set for leaderboard submission:

python vcr/test.py \
  --a-cfg <cfg_of_q2a> --r-cfg <cfg_of_qa2r> \
  --a-ckpt <checkpoint_of_q2a> --r-ckpt <checkpoint_of_qa2r> \
  --gpus <indexes_of_gpus_to_use> \
  --result-path <dir_to_save_result> --result-name <result_file_name>

VQA

Generate prediction results on test set for EvalAI submission:

python vqa/test.py \
  --cfg <cfg_file> \
  --ckpt <checkpoint> \
  --gpus <indexes_of_gpus_to_use> \
  --result-path <dir_to_save_result> --result-name <result_file_name>

RefCOCO+

Local evaluation on val/testA/testB set:

python refcoco/test.py \
  --split <val|testA|testB> \
  --cfg <cfg_file> \
  --ckpt <checkpoint> \
  --gpus <indexes_of_gpus_to_use> \
  --result-path <dir_to_save_result> --result-name <result_file_name>

Visualization

See VISUALIZATION.md.

Acknowledgements

Many thanks to following codes that help us a lot in building this codebase:

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

Related tags

Overview

VL-BERT

Introduction

Citing VL-BERT

Prepare

Environment

Data

Pre-trained Models

Training

Distributed Training on Single-Machine

Distributed Training on Multi-Machine

Non-Distributed Training

Evaluation

VCR

VQA

RefCOCO+

Visualization

Acknowledgements

Owner

Weijie Su

The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"

Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Check out the StyleGAN repo and place it in the same directory hierarchy as the present repo

Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

PyTorch implementations of the NeRF model described in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"

An University Project of Quera Web Crawling.

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

(JMLR' 19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

A python interface for training Reinforcement Learning bots to battle on pokemon showdown

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

Learning Neural Painters Fast! using PyTorch and Fast.ai

TensorFlow Tutorials with YouTube Videos

Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"

End-to-end machine learning project for rices detection

Pytorch Implementation of PointNet and PointNet++++

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

Learning to Draw: Emergent Communication through Sketching

HPRNet: Hierarchical Point Regression for Whole-Body Human Pose Estimation

Ensemble Visual-Inertial Odometry (EnVIO)

[NeurIPS '21] Adversarial Attacks on Graph Classification via Bayesian Optimisation (GRABNEL)