Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Last update: Dec 24, 2022

Related tags

Deep Learning E.T.

Overview

Episodic Transformers (E.T.)

Episodic Transformer for Vision-and-Language Navigation
Alexander Pashevich, Cordelia Schmid, Chen Sun

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. This code reproduces the results obtained with E.T. on ALFRED benchmark. To learn more about the benchmark and the original code, please refer to ALFRED repository.

Quickstart

Clone repo:

$ git clone https://github.com/alexpashevich/E.T..git ET
$ export ET_ROOT=$(pwd)/ET
$ export ET_LOGS=$ET_ROOT/logs
$ export ET_DATA=$ET_ROOT/data
$ export PYTHONPATH=$PYTHONPATH:$ET_ROOT

Install requirements:

$ virtualenv -p $(which python3.7) et_env
$ source et_env/bin/activate

$ cd $ET_ROOT
$ pip install --upgrade pip
$ pip install -r requirements.txt

Downloading data and checkpoints

Download ALFRED dataset:

$ cd $ET_DATA
$ sh download_data.sh json_feat

Copy pretrained checkpoints:

$ wget http://pascal.inrialpes.fr/data2/apashevi/et_checkpoints.zip
$ unzip et_checkpoints.zip
$ mv pretrained $ET_LOGS/

Render PNG images and create an LMDB dataset with natural language annotations:

$ python -m alfred.gen.render_trajs
$ python -m alfred.data.create_lmdb with args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth args.data_output=lmdb_human args.vocab_path=$ET_ROOT/files/human.vocab

Note #1: For rendering, you may need to configure args.x_display to correspond to an X server number running on your machine.
Note #2: We do not use JPG images from the full dataset as they would differ from the images rendered during evaluation due to the JPG compression.

Pretrained models evaluation

Evaluate an E.T. agent trained on human data only:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: make sure that your LMDB database is called exactly lmdb_human as the word embedding won't be loaded otherwise.

Evaluate an E.T. agent trained on human and synthetic data:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_synth_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: For evaluation, you may need to configure eval.x_display to correspond to an X server number running on your machine.

E.T. with human data only

Train an E.T. agent:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_s1 exp.data.train=lmdb_human train.seed=1

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with language pretraining

Language encoder pretraining with the translation objective:

$ python -m alfred.model.train with exp.model=speaker exp.name=translator exp.data.train=lmdb_human

Train an E.T. agent with the language pretraining:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_synth_s1 exp.data.train=lmdb_human train.seed=1 exp.pretrained_path=translator

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_synth_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with joint training

You can also generate more synthetic trajectories using generate_trajs.py, create an LMDB and jointly train a model on it. Please refer to the original ALFRED code to know more the data generation. The steps to reproduce the results are the following:

Generate 45K trajectories with alfred.gen.generate_trajs.
Create a synthetic LMDB dataset called lmdb_synth_45K using args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth and args.vocab_path=$ET_ROOT/files/synth.vocab.
Train an E.T. agent using exp.data.train=lmdb_human,lmdb_synth_45K.

Citation

If you find this repository useful, please cite our work:

@misc{pashevich2021episodic,
  title ={{Episodic Transformer for Vision-and-Language Navigation}},
  author={Alexander Pashevich and Cordelia Schmid and Chen Sun},
  year={2021},
  eprint={2105.06453},
  archivePrefix={arXiv},
}

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Related tags

Overview

Episodic Transformers (E.T.)

Quickstart

Downloading data and checkpoints

Pretrained models evaluation

E.T. with human data only

E.T. with language pretraining

E.T. with joint training

Citation

Owner

Alex Pashevich

The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Proximal Backpropagation - a neural network training algorithm that takes implicit instead of explicit gradient steps

GraphGT: Machine Learning Datasets for Graph Generation and Transformation

Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

🚩🚩🚩

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

alfred-py: A deep learning utility library for human

A Factor Model for Persistence in Investment Manager Performance

ConformalLayers: A non-linear sequential neural network with associative layers

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

Neural Radiance Fields Using PyTorch

A pyparsing-based library for parsing SOQL statements

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

SoGCN: Second-Order Graph Convolutional Networks

This repository contains the official implementation code of the paper Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Modelisation on galaxy evolution using PEGASE-HR

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Related tags

Overview

Episodic Transformers (E.T.)

Quickstart

Downloading data and checkpoints

Pretrained models evaluation

E.T. with human data only

E.T. with language pretraining

E.T. with joint training

Citation

Owner

Alex Pashevich

The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Proximal Backpropagation - a neural network training algorithm that takes implicit instead of explicit gradient steps

GraphGT: Machine Learning Datasets for Graph Generation and Transformation

Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

🚩🚩🚩

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

alfred-py: A deep learning utility library for **human**

A Factor Model for Persistence in Investment Manager Performance

ConformalLayers: A non-linear sequential neural network with associative layers

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

Neural Radiance Fields Using PyTorch

A pyparsing-based library for parsing SOQL statements

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

SoGCN: Second-Order Graph Convolutional Networks

This repository contains the official implementation code of the paper Transformer-based Feature Reconstruction Network for Robust Multimodal Sentiment Analysis

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Modelisation on galaxy evolution using PEGASE-HR

alfred-py: A deep learning utility library for human