PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is the PyTorch code of the BLIP paper. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

  • Inference demo
  • Pre-trained and finetuned checkpoints
  • Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
  • Pre-training code
  • Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for: (1) image captioning, (2) open-ended visual question answering, (3) multimodal / unimodal feature extraction.

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo Hugging Face Spaces

Pre-trained checkpoints:

Num. pre-train images BLIP w/ ViT-B BLIP w/ ViT-B and CapFilt-L BLIP w/ ViT-L
14M Download - -
129M Download Download Download

Finetuned checkpoints:

Task BLIP w/ ViT-B BLIP w/ ViT-B and CapFilt-L BLIP w/ ViT-L
Image-Text Retrieval (COCO) Download - Download
Image-Text Retrieval (Flickr30k) Download - Download
Image Captioning (COCO) - Download Download
VQA Download Download -
NLVR2 Download - -

Image-Text Retrieval:

  1. Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Image-Text Captioning:

  1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate
  1. To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py 
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py 

VQA:

  1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
  2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_vqa.py 

NLVR2:

  1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
  2. To evaluate the finetuned BLIP model, run
python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py 

Pre-train:

  1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
  2. In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
  3. Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain 

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source Filtered web caption Filtered synthetic caption Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU Download Download Download
LAION115M Download Download Download

Citation

If you find this code to be useful for your research, please consider citing.

@misc{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      eprint={2201.12086},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Collection of Docker images for ML/DL and video processing projects

Collection of Docker images for ML/DL and video processing projects. Overview of images Three types of images differ by tag postfix: base: Python with

OSAI 87 Nov 22, 2022
ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI'22)

ShuttleNet: Position-aware Rally Progress and Player Styles Fusion for Stroke Forecasting in Badminton (AAAI 2022) Official code of the paper ShuttleN

Wei-Yao Wang 11 Nov 30, 2022
ExCon: Explanation-driven Supervised Contrastive Learning

ExCon: Explanation-driven Supervised Contrastive Learning Contributors of this repo: Zhibo Zhang ( Zhibo (Darren) Zhang 18 Nov 01, 2022

Picasso: a methods for embedding points in 2D in a way that respects distances while fitting a user-specified shape.

Picasso Code to generate Picasso embeddings of any input matrix. Picasso maps the points of an input matrix to user-defined, n-dimensional shape coord

Pachter Lab 45 Dec 23, 2022
City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Code

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Requirements Python 3.8 or later with all requirements.txt dependencies installed,

88 Dec 12, 2022
Official Pytorch Implementation of Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images.

IAug_CDNet Official Implementation of Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images. Overview We propose a

53 Dec 02, 2022
Group Fisher Pruning for Practical Network Compression(ICML2021)

Group Fisher Pruning for Practical Network Compression (ICML2021) By Liyang Liu*, Shilong Zhang*, Zhanghui Kuang, Jing-Hao Xue, Aojun Zhou, Xinjiang W

Shilong Zhang 129 Dec 13, 2022
A few stylization coreML models that I've trained with CreateML

CoreML-StyleTransfer A few stylization coreML models that I've trained with CreateML You can open and use the .mlmodel files in the "models" folder in

Doron Adler 8 Aug 18, 2022
dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ)

dualFace dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ) We provide python implementations for our CVM 2021 paper "dualFac

Haoran XIE 46 Nov 10, 2022
Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis

Readme File for "Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis" by Ham, Imai, and Janson. (2022) All scripts were written and

0 Jan 27, 2022
A diff tool for language models

LMdiff Qualitative comparison of large language models. Demo & Paper: http://lmdiff.net LMdiff is a MIT-IBM Watson AI Lab collaboration between: Hendr

Hendrik Strobelt 27 Dec 29, 2022
Accelerated Multi-Modal MR Imaging with Transformers

Accelerated Multi-Modal MR Imaging with Transformers Dependencies numpy==1.18.5 scikit_image==0.16.2 torchvision==0.8.1 torch==1.7.0 runstats==1.8.0 p

54 Dec 16, 2022
Density-aware Single Image De-raining using a Multi-stream Dense Network (CVPR 2018)

DID-MDN Density-aware Single Image De-raining using a Multi-stream Dense Network He Zhang, Vishal M. Patel [Paper Link] (CVPR'18) We present a novel d

He Zhang 224 Dec 12, 2022
QT Py Media Knob using rotary encoder & neopixel ring

QTPy-Knob QT Py USB Media Knob using rotary encoder & neopixel ring The QTPy-Knob features: Media knob for volume up/down/mute with "qtpy-knob.py" Cir

Tod E. Kurt 56 Dec 30, 2022
Code for the paper "Curriculum Dropout", ICCV 2017

Curriculum Dropout Dropout is a very effective way of regularizing neural networks. Stochastically "dropping out" units with a certain probability dis

Pietro Morerio 21 Jan 02, 2022
Official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). VaxNeRF provides very fast training and slightl

naruya 132 Nov 21, 2022
Prml - Repository of notes, code and notebooks in Python for the book Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition and Machine Learning (PRML) This project contains Jupyter notebooks of many the algorithms presented in Christopher Bishop's Patte

Gerardo Durán-Martín 1k Jan 07, 2023
pytorch implementation of trDesign

trdesign-pytorch This repository is a PyTorch implementation of the trDesign paper based on the official TensorFlow implementation. The initial port o

Learn Ventures Inc. 41 Dec 29, 2022
Robust Self-augmentation for NER with Meta-reweighting

Robust Self-augmentation for NER with Meta-reweighting

Lam chi 17 Nov 22, 2022
This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

deSpeckNet-TF-GEE This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling publi

Adugna Mullissa 16 Sep 07, 2022