Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Last update: Dec 11, 2022

Related tags

Overview

merlot_reserve

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

MERLOT Reserve (in submission) is a model for learning joint representations of vision, language, and sound from YouTube. The learned model can be used in a zero-shot or finetuned setting, where it does well on tasks like VCR and TVQA.

Visit our project page at rowanzellers.com/merlotreserve or read the full paper to learn more.

What's here

We are releasing the following:

JAX code, and model checkpoints, for the MERLOT model
Code for pretraining the model
Code for finetuning the model on VCR and TVQA
Code for doing zero-shot inference with the model

Environment and setup

There are two different ways to run MERLOT Reserve:

Pretraining on videos You'll need a TPU Pod VM for this. This step shouldn't be necessary for most people, as we have released model checkpoints.
Finetuning on VCR or TVQA I've done this on a TPU v3-8 VM. This should be possible on GPU(s), but I haven't tested this on such hardware.
Zero-shot inference I've ran this on a GPU (even an older, Titan X from 2016 works.)

Installation on a GPU Machine

Install Cuda 11.4 (I used this link) and CUDNN 8.2. You might have to add something like this to your PATH:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Create the environment:

conda create --name mreserve python=3.8 && conda activate mreserve
conda install -y python=3.8 tqdm numpy pyyaml scipy ipython cython typing h5py pandas matplotlib

# Install jax
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_releases.html
# If doing this on TPUs instead of locally...
# pip install "jax[tpu]>=0.2.18" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# This is needed sometimes https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
pip uninstall numpy
pip install numpy==1.19.5

pip install -r requirements.txt

You can then try out the interactive script at demo/demo_video.py. It will handle downloading the model checkpoint for you.

Installation on a Cloud TPU VM

See the instructions in pretrain/ to set up your environment on a TPU v3-8 VM.

Checkpoints

These should get auto-downloaded if you use PretrainedMerlotReserve in mreserve/modeling.py. All are flax checkpoint files:

# pretrained checkpoints
gs://merlotreserve/ckpts/base
gs://merlotreserve/ckpts/base_resadapt
gs://merlotreserve/ckpts/large
gs://merlotreserve/ckpts/large_resadapt

# finetuned checkpoints
gs://merlotreserve/vcr_ckpts/vcr_finetune_base
gs://merlotreserve/vcr_ckpts/vcr_finetune_large

gs://merlotreserve/tvqa_ckpts/tvqa_finetune_base
gs://merlotreserve/tvqa_ckpts/tvqa_finetune_large

# TVQA Data
gs://merlotreserve/finetune_data/tvqa/

# VCR data
gs://merlotreserve/finetune_data/vcr/

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Related tags

Overview

merlot_reserve

What's here

Environment and setup

Installation on a GPU Machine

Installation on a Cloud TPU VM

Checkpoints

Owner

Rowan Zellers

A Tensorflow based library for Time Series Modelling with Gaussian Processes

Pytorch Geometric Tutorials

Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks (MAPDN)

Official PyTorch Implementation of Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition, ICCV 2021

DABO: Data Augmentation with Bilevel Optimization

Related resources for our EMNLP 2021 paper

Official implementation of the article "Unsupervised JPEG Domain Adaptation For Practical Digital Forensics"

[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Improving Factual Completeness and Consistency of Image-to-text Radiology Report Generation

Pathdreamer: A World Model for Indoor Navigation

Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

Lightweight stereo matching network based on MobileNetV1 and MobileNetV2

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

Source code for "FastBERT: a Self-distilling BERT with Adaptive Inference Time".

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

[ICCV 2021] Our work presents a novel neural rendering approach that can efficiently reconstruct geometric and neural radiance fields for view synthesis.

Parallel Latent Tree-Induction for Faster Sequence Encoding

Codes for the AAAI'22 paper "TransZero: Attribute-guided Transformer for Zero-Shot Learning"