End-to-End Referring Video Object Segmentation with Multimodal Transformers

This repo contains the official implementation of the paper:

End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files)

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command	Description
-c	path to dataset configuration file
-rm	running mode (train/eval)
-ws	window size
-bs	training batch size per GPU
-ebs	eval batch size per GPU (if not provided, training batch size is used)
-ng	number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size	Command	Checkpoint File	mAP Result
10	`python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	46.1
8	`python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size	Command	Checkpoint File	mAP Result
10	`python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	39.2
8	`python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2`	Link	36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size	Command	Checkpoint File	Output Zip	J&F Result
12	`python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8`	Link	Link	55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

The command for the following configuration was tested on 2 A6000 48GB GPUs:

Window Size	Command
10	`python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2`

The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:

Window Size	Command
8	`python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3`

Refer-YouTube-VOS

The command for the following configuration was tested on 4 A6000 48GB GPUs:

Window Size	Command
12	`python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4`

The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.

Window Size	Command
8	`python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8`

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

How to Run the Code

Dataset Requirements

A2D-Sentences

Refer-YouTube-VOS

Environment Installation

Running Configuration

Evaluation

A2D-Sentences

JHMDB-Sentences

Refer-YouTube-VOS

Training

A2D-Sentences

Refer-YouTube-VOS

JHMDB-Sentences

Owner

[AAAI-2022] Official implementations of MCL: Mutual Contrastive Learning for Visual Representation Learning

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes (CVPR 2021 Oral)

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Simple STAC Catalogs discovery tool.

Code for "Long Range Probabilistic Forecasting in Time-Series using High Order Statistics"

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

A repository that finds a person who looks like you by using face recognition technology.

Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"

Yoga - Yoga asana classifier for python

Codes of the paper Deformable Butterfly: A Highly Structured and Sparse Linear Transform.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

[ICCV'21] Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment

(ICCV 2021 Oral) Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.

Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks.

A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

For medical image segmentation

Single object tracking and segmentation.

Generic image compressor for machine learning. Pytorch code for our paper "Lossy compression for lossless prediction".