Bilinear attention networks for visual question answering

Overview

Bilinear Attention Networks

This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entities tasks.

For the visual question answering task, our single model achieved 70.35 and an ensemble of 15 models achieved 71.84 (Test-standard, VQA 2.0). For the Flickr30k Entities task, our single model achieved 69.88 / 84.39 / 86.40 for [email protected], 5, and 10, respectively (slightly better than the original paper). For the detail, please refer to our technical report.

This repository is based on and inspired by @hengyuan-hu's work. We sincerely thank for their sharing of the codes.

Overview of bilinear attention networks

Updates

  • Bilinear attention networks using torch.einsum, backward-compatible. (12 Mar 2019)
  • Now compatible with PyTorch v1.0.1. (12 Mar 2019)

Prerequisites

You may need a machine with 4 GPUs, 64GB memory, and PyTorch v1.0.1 for Python 3.

  1. Install PyTorch with CUDA and Python 3.6.
  2. Install h5py.

WARNING: do not use PyTorch v1.0.0 due to a bug which induces underperformance.

VQA

Preprocessing

Our implementation uses the pretrained features from bottom-up-attention, the adaptive 10-100 features per image. In addition to this, the GloVe vectors. For the simplicity, the below script helps you to avoid a hassle.

All data should be downloaded to a data/ directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

For now, you should manually download for the below options (used in our best single model).

We use a part of Visual Genome dataset for data augmentation. The image meta data and the question answers of Version 1.2 are needed to be placed in data/.

We use MS COCO captions to extract semantically connected words for the extended word embeddings along with the questions of VQA 2.0 and Visual Genome. You can download in here. Since the contribution of these captions is minor, you can skip the processing of MS COCO captions by removing cap elements in the target option in this line.

Counting module (Zhang et al., 2018) is integrated in this repository as counting.py for your convenience. The source repository can be found in @Cyanogenoid's vqa-counting.

Training

$ python3 main.py --use_both True --use_vg True

to start training (the options for the train/val splits and Visual Genome to train, respectively). The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default hyperparameters should give you the best result of single model, which is around 70.04 for test-dev split.

Validation

If you trained a model with the training split using

$ python3 main.py

then you can run evaluate.py with appropriate options to evaluate its score for the validation split.

Pretrained model

We provide the pretrained model reported as the best single model in the paper (70.04 for test-dev, 70.35 for test-standard).

Please download the link and move to saved_models/ban/model_epoch12.pth (you may encounter a redirection page to confirm). The training log is found in here.

$ python3 test.py --label mytest

The result json file will be found in the directory results/.

Without Visual Genome augmentation

Without the Visual Genome augmentation, we get 69.50 (average of 8 models with the standard deviation of 0.096) for the test-dev split. We use the 8-glimpse model, the learning rate is starting with 0.001 (please see this change for the better results), 13 epochs, and the batch size of 256.

Flickr30k Entities

Preprocessing

You have to manually download Annotation and Sentence files to data/flickr30k/Flickr30kEntities.tar.gz. Then run the provided script tools/download_flickr.sh and tools/process_flickr.sh from the root of this repository, similarly to the case of VQA. Note that the image features of Flickr30k were generated using bottom-up-attention pretrained model.

Training

$ python3 main.py --task flickr --out saved_models/flickr

to start training. --gamma option does not applied. The default hyperparameters should give you approximately 69.6 for [email protected] for the test split.

Validation

Please download the link and move to saved_models/flickr/model_epoch5.pth (you may encounter a redirection page to confirm).

$ python3 evaluate.py --task flickr --input saved_models/flickr --epoch 5

to evaluate the scores for the test split.

Troubleshooting

Please check troubleshooting wiki and previous issue history.

Citation

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@inproceedings{Kim2018,
author = {Kim, Jin-Hwa and Jun, Jaehyun and Zhang, Byoung-Tak},
booktitle = {Advances in Neural Information Processing Systems 31},
title = {{Bilinear Attention Networks}},
pages = {1571--1581},
year = {2018}
}

License

MIT License

Owner
Jin-Hwa Kim
Jin-Hwa Kim
Equivariant Imaging: Learning Beyond the Range Space

Equivariant Imaging: Learning Beyond the Range Space Equivariant Imaging: Learning Beyond the Range Space Dongdong Chen, Julián Tachella, Mike E. Davi

Dongdong Chen 46 Jan 01, 2023
Collection of TensorFlow2 implementations of Generative Adversarial Network varieties presented in research papers.

TensorFlow2-GAN Collection of tf2.0 implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will

41 Apr 28, 2022
PyTorch wrappers for using your model in audacity!

audacitorch This package contains utilities for prepping PyTorch audio models for use in Audacity. More specifically, it provides abstract classes for

Hugo Flores García 130 Dec 14, 2022
Python package for visualizing the loss landscape of parameterized quantum algorithms.

orqviz A Python package for easily visualizing the loss landscape of Variational Quantum Algorithms by Zapata Computing Inc. orqviz provides a collect

Zapata Computing, Inc. 75 Dec 30, 2022
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
A modern pure-Python library for reading PDF files

pdf A modern pure-Python library for reading PDF files. The goal is to have a modern interface to handle PDF files which is consistent with itself and

6 Apr 06, 2022
Code for the paper "Jukebox: A Generative Model for Music"

Status: Archive (code is provided as-is, no updates expected) Jukebox Code for "Jukebox: A Generative Model for Music" Paper Blog Explorer Colab Insta

OpenAI 6k Jan 02, 2023
[ICCV2021] Learning to Track Objects from Unlabeled Videos

Unsupervised Single Object Tracking (USOT) 🌿 Learning to Track Objects from Unlabeled Videos Jilai Zheng, Chao Ma, Houwen Peng and Xiaokang Yang 2021

53 Dec 28, 2022
Implement face detection, and age and gender classification, and emotion classification.

YOLO Keras Face Detection Implement Face detection, and Age and Gender Classification, and Emotion Classification. (image from wider face dataset) Ove

Chloe 10 Nov 14, 2022
Multi-modal Content Creation Model Training Infrastructure including the FACT model (AI Choreographer) implementation.

AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ [ICCV-2021]. Overview This package contains the model implementation and training

Google Research 365 Dec 30, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022
The source code for Adaptive Kernel Graph Neural Network at AAAI2022

AKGNN The source code for Adaptive Kernel Graph Neural Network at AAAI2022. Please cite our paper if you think our work is helpful to you: @inproceedi

11 Nov 25, 2022
codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification

DLCF-DCA codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification. submitted t

15 Aug 30, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences an

Microsoft 8k Jan 04, 2023
[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

VITA 156 Nov 28, 2022
SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement

SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement This repository implements the approach described in SporeAgent: Reinforced

Dominik Bauer 5 Jan 02, 2023
A full pipeline AutoML tool for tabular data

HyperGBM Doc | 中文 We Are Hiring! Dear folks,we are offering challenging opportunities located in Beijing for both professionals and students who are k

DataCanvas 240 Jan 03, 2023
SCALoss: Side and Corner Aligned Loss for Bounding Box Regression (AAAI2022).

SCALoss PyTorch implementation of the paper "SCALoss: Side and Corner Aligned Loss for Bounding Box Regression" (AAAI 2022). Introduction IoU-based lo

TuZheng 20 Sep 07, 2022
Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

Creating Robust Representations from Pre-Trained Image Encoders using Contrastive Learning Sriram Ravula, Georgios Smyrnis This is the code for our pr

Sriram Ravula 26 Dec 10, 2022