[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Last update: Dec 13, 2022

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

VQA Codebase
Pre-training Codebase
Other Downstream Tasks

Installation

conda create -n soho python=3.7
conda activate soho
git clone https://github.com/researchmm/soho.git
cd soho
bash tools/install.sh

Getting Started

Download the training, validation and test data

mkdir -p $SOHO_ROOT/data/coco
cd $SOHO_ROOT/data/coco
# need to update
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test2015.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test_data_qa.json

Download the Pre-training models

cd $SOHO_ROOT
mkdir -p $SOHO_ROOT/pretrained
cd $SOHO_ROOT/pretrained
# the following need to update
wget

Training a VQA model

cd $SOHO_ROOT
#use 8 GPUS to train the model
bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8

Evaluate a VQA model

bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

Owner

Multimedia Research

The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

The final project of "Applying AI to 2D Medical Imaging Data" of "AI for Healthcare" nanodegree - Udacity.

Keras documentation, hosted live at keras.io

Photo2cartoon - 人像卡通化探索项目 (photo-to-cartoon translation project)

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

Randomized Correspondence Algorithm for Structural Image Editing

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

[CVPR 2020] 3D Photography using Context-aware Layered Depth Inpainting

The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Code for the published paper : Learning to recognize rare traffic sign

RLHive: a framework designed to facilitate research in reinforcement learning.

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Python implementation of NARS (Non-Axiomatic-Reasoning-System)

Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition"

This game was designed to encourage young people not to gamble on lotteries, as the probablity of correctly guessing the number is infinitesimal!

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

(Personalized) Page-Rank computation using PyTorch

Official implementation of the paper 'High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network' in CVPR 2021