project page for VinVL

Related tags

Deep LearningVinVL
Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task t2i t2i i2t i2t IC IC IC IC NoCaps NoCaps VQA NLVR2 GQA
Metric [email protected] [email protected] [email protected] [email protected] [email protected] M C S C S test-std test-P test-std
SoTA_S 39.2 68.0 56.6 84.5 38.9 29.2 129.8 22.4 61.5 9.2 70.92 58.80 63.17
SoTA_B 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 86.58 12.38 73.67 79.30 61.62
SoTA_L 57.5 82.8 73.5 92.2 41.7 30.6 140.0 24.5 - - 74.93 81.47 -
----- --- --- --- --- --- --- --- --- --- --- --- --- ---
VinVL_B 58.1 83.2 74.6 92.6 40.9 30.9 140.6 25.1 92.46 13.07 76.12 83.08 64.65
VinVL_L 58.8 83.5 75.4 92.9 41.0 31.1 140.9 25.2 - - 76.62 83.98 -
gain 1.3 0.7 1.9 0.6 -0.7 0.5 0.9 0.7 5.9 0.7 1.69 2.51 1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task t2i t2i i2t i2t IC IC IC IC NoCaps NoCaps VQA NLVR2 GQA
metric [email protected] [email protected] [email protected] [email protected] [email protected] M C S C S test-std test-P test-std
bottom-up and top-down model 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 86.58 12.38 73.16 78.07 61.62
VinVL (ours) 58.1 83.2 74.6 92.6 40.9 30.9 140.6 25.1 92.46 13.07 75.95 83.08 64.65
gain 4.1 2.4 4.6 1.5 0.4 1.2 3.0 2.3 5.9 0.7 2.79 4.71 3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}
code from "Tensor decomposition of higher-order correlations by nonlinear Hebbian plasticity"

Code associated with the paper "Tensor decomposition of higher-order correlations by nonlinear Hebbian learning," Ocker & Buice, Neurips 2021. "plot_f

Gabriel Koch Ocker 4 Oct 16, 2022
Official repo for SemanticGAN https://nv-tlabs.github.io/semanticGAN/

SemanticGAN This is the official code for: Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalizat

151 Dec 28, 2022
Meandering In Networks of Entities to Reach Verisimilar Answers

MINERVA Meandering In Networks of Entities to Reach Verisimilar Answers Code and models for the paper Go for a Walk and Arrive at the Answer - Reasoni

Shehzaad Dhuliawala 271 Dec 13, 2022
A PyTorch toolkit for 2D Human Pose Estimation.

PyTorch-Pose PyTorch-Pose is a PyTorch implementation of the general pipeline for 2D single human pose estimation. The aim is to provide the interface

Wei Yang 1.1k Dec 30, 2022
RefineMask (CVPR 2021)

RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features (CVPR 2021) This repo is the official implementation of RefineMask:

Gang Zhang 191 Jan 07, 2023
DNA sequence classification by Deep Neural Network

DNA sequence classification by Deep Neural Network: Project Overview worked on the DNA sequence classification problem where the input is the DNA sequ

Mohammed Jawwadul Islam Fida 0 Aug 02, 2022
Wandb-predictions - WANDB Predictions With Python

WANDB API CI/CD Below we capture the CI/CD scenarios that we would expect with o

Anish Shah 6 Oct 07, 2022
Clustergram - Visualization and diagnostics for cluster analysis in Python

Clustergram Visualization and diagnostics for cluster analysis Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A

Martin Fleischmann 96 Dec 26, 2022
This repository contains code and data for "On the Multimodal Person Verification Using Audio-Visual-Thermal Data"

trimodal_person_verification This repository contains the code, and preprocessed dataset featured in "A Study of Multimodal Person Verification Using

ISSAI 7 Aug 31, 2022
Python with OpenCV - MediaPip Framework Hand Detection

Python HandDetection Python with OpenCV - MediaPip Framework Hand Detection Explore the docs ยป Contact Me About The Project It is a Computer vision pa

2 Jan 07, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models Requirements A suitable conda environment named ldm can be created and activated with: conda env create -f environment.yaml co

CompVis Heidelberg 5.6k Jan 04, 2023
This is a work in progress reimplementation of Instant Neural Graphics Primitives

Neural Hash Encoding This is a work in progress reimplementation of Instant Neural Graphics Primitives Currently this can train an implicit representa

Penn 79 Sep 01, 2022
This repository contains Prior-RObust Bayesian Optimization (PROBO) as introduced in our paper "Accounting for Gaussian Process Imprecision in Bayesian Optimization"

Prior-RObust Bayesian Optimization (PROBO) Introduction, TOC This repository contains Prior-RObust Bayesian Optimization (PROBO) as introduced in our

Julian Rodemann 2 Mar 19, 2022
Social Distancing Detector

Computer vision has opened up a lot of opportunities to explore into AI domain that were earlier highly limited. Here is an application of haarcascade classifier and OpenCV to develop a social distan

Ashish Pandey 2 Jul 18, 2022
Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)

This repository contains tools to simulate the ground filtering process of a registered point cloud. The repository contains two filtering methods. The first method uses a normal vector, and fit to p

5 Aug 25, 2022
Medical image analysis framework merging ANTsPy and deep learning

ANTsPyNet A collection of deep learning architectures and applications ported to the python language and tools for basic medical image processing. Bas

Advanced Normalization Tools Ecosystem 118 Dec 24, 2022
Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Cross-modal Retrieval using Transformer Encoder Reasoning Networks This project reimplements the idea from "Transformer Reasoning Network for Image-Te

Minh-Khoi Pham 5 Nov 05, 2022
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 107 Apr 20, 2022
Implementation of "Semi-supervised Domain Adaptive Structure Learning"

Semi-supervised Domain Adaptive Structure Learning - ASDA This repo contains the source code and dataset for our ASDA paper. Illustration of the propo

3 Dec 13, 2021
Alphabetical Letter Recognition

BayeesNetworks-Image-Classification Alphabetical Letter Recognition In these demo we are using "Bayees Networks" Our database is composed by Learning

Mohammed Firass 4 Nov 30, 2021