History Aware Multimodal Transformer for Vision-and-Language Navigation

Last update: Nov 23, 2022

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

Installation

Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.
(Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

History Aware Multimodal Transformer for Vision-and-Language Navigation

Related tags

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

Installation

Extracting features (optional)

Training with proxy tasks

Fine-tuning for sequential action prediction

Citation

Acknowledgement

Owner

Shizhe Chen

A Chinese to English Neural Model Translation Project

Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Ray-based parallel data preprocessing for NLP and ML.

Sequence modeling benchmarks and temporal convolutional networks

Uses Google's gTTS module to easily create robo text readin' on command.

Script and models for clustering LAION-400m CLIP embeddings.

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

🎐 a python library for doing approximate and phonetic matching of strings.

Задания КЕГЭ по информатике 2021 на Python

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Implementation of ProteinBERT in Pytorch

NLP applications using deep learning.

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Weird Sort-and-Compress Thing

Top2Vec is an algorithm for topic modeling and semantic search.

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time