VD-BERT: A Unified Vision and Dialog Transformer with BERT

Related tags

Deep LearningVD-BERT
Overview

VD-BERT: A Unified Vision and Dialog Transformer with BERT

PyTorch Code for the following paper at EMNLP2020:
Title: VD-BERT: A Unified Vision and Dialog Transformer with BERT [pdf]
Authors: Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi
Institute: Salesforce Research and CUHK
Abstract
Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard.

Framework illustration
VD-BERT framework

Installation

Package: Pytorch 1.1; We alo provide our Dockerfile and YAML file for setting up experiments in Google Cloud Platform (GCP).
Data: you can obtain the VisDial data from here
Visual features: we provide bottom-up attention visual features of VisDial v1.0 on data/img_feats1.0/. If you would like to extract visual features for other images, please refer to this docker image. We provide the running script on data/visual_extract_code.py, which should be used inside the provided bottom-up-attention image.

Code explanation

vdbert: store the main training and testing python files, data loader code, metrics and the ensemble code;

pytorch_pretrained_bert: mainly borrow from the Huggingface's pytorch-transformers v0.4.0;

  • modeling.py: we modify or add two classes: BertForPreTrainingLossMask and BertForVisDialGen;
  • rank_loss.py: three ranking methods: ListNet, ListMLE, approxNDCG;

sh: shell scripts to run the experiments

pred: store two json files for best single-model (74.54 NDCG) and ensemble model (75.35 NDCG)

model: You can download a pretrained model from https://storage.cloud.google.com/sfr-vd-bert-research/v1.0_from_BERT_e30.bin

Running experiments

Below the running example scripts for pretraining, finetuning (including dense annotation), and testing.

  • Pretraining bash sh/pretrain_v1.0_mlm_nsp_g4.sh
  • Finetuning for discriminative bash sh/finetune_v1.0_disc_g4.sh
  • Finetuning for discriminative specifically on dense annotation bash sh/finetune_v1.0_disc_dense_g4.sh
  • Finetuning for generative bash sh/finetune_v1.0_gen_g4.sh
  • Testing for discriminative on validation bash sh/test_v1.0_disc_val.sh
  • Testing for generative on validation bash sh/test_v1.0_gen_val.sh
  • Testing for discriminative on test bash sh/test_v1.0_disc_test.sh

Notation: mlm: masked language modeling, nsp: next sentence prediction, disc: discriminative, gen: generative, g4: 4 gpus, dense: dense annotation

Citation

If you find the code useful in your research, please consider citing our paper:

@inproceedings{
    wang2020vdbert,
    title={VD-BERT: A Unified Vision and Dialog Transformer with BERT},
    author={Yue Wang, Shafiq Joty, Michael R. Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi},
    booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020},
    year={2020},
}

License

This project is licensed under the terms of the MIT license.

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
Continual Learning of Electronic Health Records (EHR).

Continual Learning of Longitudinal Health Records Repo for reproducing the experiments in Continual Learning of Longitudinal Health Records (2021). Re

Jacob 7 Oct 21, 2022
Provably Rare Gem Miner.

Provably Rare Gem Miner just another random project by yoyoismee.eth useful link main site market contract useful thing you should know read contract

34 Nov 22, 2022
This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''.

Sparse VAE This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''. Data Sources The datasets used in this paper wer

Gemma Moran 17 Dec 12, 2022
Azion the best solution of Edge Computing in the world.

Azion Edge Function docker action Create or update an Edge Functions on Azion Edge Nodes. The domain name is the key for decision to a create or updat

8 Jul 16, 2022
Boston House Prediction Valuation Tool

Boston-House-Prediction-Valuation-Tool From Below Anlaysis The Valuation Tool is Designed Correlation Matrix Regrssion Analysis Between Target Vs Pred

0 Sep 09, 2022
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

NU-Wave β€” Official PyTorch Implementation NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling Junhyeok Lee, Seungu Han @ MINDsLab Inc

MINDs Lab 242 Dec 23, 2022
πŸš€ PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)"

PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)" Unofficial PyTorch Implementation of Progressi

Vitaliy Hramchenko 58 Dec 19, 2022
Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation Woncheol Shin1, Gyubok Lee1, Jiyoung Lee1, Joonseok Lee2,3, Edward Ch

Woncheol Shin 7 Sep 26, 2022
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Deep Reinforcement Learning Agents This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython noteb

Arthur Juliani 2.2k Jan 01, 2023
Dirty Pixels: Towards End-to-End Image Processing and Perception

Dirty Pixels: Towards End-to-End Image Processing and Perception This repository contains the code for the paper Dirty Pixels: Towards End-to-End Imag

50 Nov 18, 2022
A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Welcome to Carbon Insight Carbon Insight is a platform aiming to display the carbon neutralization roadmap for researchers, decision-makers, and other

Microsoft 14 Oct 24, 2022
Few-Shot Object Detection via Association and DIscrimination

Few-Shot Object Detection via Association and DIscrimination Code release of our NeurIPS 2021 paper: Few-Shot Object Detection via Association and DIs

Cao Yuhang 49 Dec 18, 2022
Deploy recommendation engines with Edge Computing

RecoEdge: Bringing Recommendations to the Edge A one stop solution to build your recommendation models, train them and, deploy them in a privacy prese

NimbleEdge 131 Jan 02, 2023
Cross-Document Coreference Resolution

Cross-Document Coreference Resolution This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in ou

Arie Cattan 29 Nov 28, 2022
A CV toolkit for my papers.

PyTorch-Encoding created by Hang Zhang Documentation Please visit the Docs for detail instructions of installation and usage. Please visit the link to

Hang Zhang 2k Jan 04, 2023
The repo of the preprinting paper "Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection"

Inferring Spatial Uncertainty in Object Detection A teaser version of the code for the paper Labels Are Not Perfect: Inferring Spatial Uncertainty in

ZINING WANG 21 Mar 03, 2022
PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚑πŸ”₯⚑

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project πŸš€ ⚑ πŸ”₯ Click on Use this template to initialize new re

Łukasz Zalewski 2.1k Jan 09, 2023
OpenDILab Multi-Agent Environment

Go-Bigger: Multi-Agent Decision Intelligence Environment GoBigger Doc (δΈ­ζ–‡η‰ˆ) Ongoing 2021.11.13 We are holding a competition β€”β€” Go-Bigger: Multi-Agent

OpenDILab 441 Jan 05, 2023
Multi-Content GAN for Few-Shot Font Style Transfer at CVPR 2018

MC-GAN in PyTorch This is the implementation of the Multi-Content GAN for Few-Shot Font Style Transfer. The code was written by Samaneh Azadi. If you

Samaneh Azadi 422 Dec 04, 2022