SeqTR: A Simple yet Universal Network for Visual Grounding

Last update: Dec 24, 2022

Overview

SeqTR

This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling for visual grounding tasks under a novel point prediction paradigm.

Installation

Prerequisites

pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Then install SeqTR package in editable mode:

pip install -e .

Data Preparation

Download our preprocessed json files including the merged dataset for pre-training, and DarkNet-53 model weights trained on MS-COCO object detection task.
Download the train2014 images from mscoco or from Joseph Redmon's mscoco mirror, of which the download speed is faster than the official website.
Download original Flickr30K images, ReferItGame images, and Visual Genome images.

The project structure should look like the following:

| -- SeqTR
     | -- data
        | -- annotations
            | -- flickr30k
                | -- instances.json
                | -- ix_to_token.pkl
                | -- token_to_ix.pkl
                | -- word_emb.npz
            | -- referitgame-berkeley
            | -- refcoco-unc
            | -- refcocoplus-unc
            | -- refcocog-umd
            | -- refcocog-google
            | -- pretraining-vg 
        | -- weights
            | -- darknet.weights
            | -- yolov3.weights
        | -- images
            | -- mscoco
                | -- train2014
                    | -- COCO_train2014_000000000072.jpg
                    | -- ...
            | -- saiaprtc12
                | -- 25.jpg
                | -- ...
            | -- flickr30k
                | -- 36979.jpg
                | -- ...
            | -- visual-genome
                | -- 2412112.jpg
                | -- ...
     | -- configs
     | -- seqtr
     | -- tools
     | -- teaser

Note that the darknet.weights excludes val/test images of RefCOCO/+/g datasets while yolov3.weights does not.

Training

Phrase Localization and Referring Expression Comprehension

We train SeqTR to perform grouning at bounding box level on a single V100 GPU. The following script performs the training:

python tools/train.py configs/seqtr/detection/seqtr_det_[DATASET_NAME].py --cfg-options ema=True

[DATASET_NAME] is one of "flickr30k", "referitgame-berkeley", "refcoco-unc", "refcocoplus-unc", "refcocog-umd", and "refcocog-google".

Referring Expression Segmentation

To train SeqTR to generate the target sequence of ground-truth mask, which is then assembled into the predicted mask by connecting the points, run the following script:

python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True

Note that instead of sampling 18 points and does not shuffle the sequence for RefCOCO dataset, for RefCOCO+ and RefCOCOg, we uniformly sample 12 points on the mask contour and randomly shffle the sequence with 20% percentage. Therefore, to execute the training on RefCOCO+/g datasets, modify num_ray at line 1 to 18 and model.head.shuffle_fraction to 0.2 at line 35, in configs/seqtr/segmentation/seqtr_mask_darknet.py.

Evaluation

python tools/test.py [PATH_TO_CONFIG_FILE] --load-from [PATH_TO_CHECKPOINT_FILE]

Pre-training + fine-tuning

We train SeqTR on 8 V100 GPUs while disabling Large Scale Jittering (LSJ) and Exponential Moving Average (EMA):

bash tools/dist_train.sh configs/seqtr/detection/seqtr_det_pretraining-vg.py 8

Models

	RefCOCO				RefCOCO+				RefCOCOg
	val	testA	testB	model	val	testA	testB	model	val-g	val-u	val-u	model
SeqTR on REC	81.23	85.00	76.08		68.82	75.37	58.78		-	71.35	71.58
SeqTR* on REC	83.72	86.51	81.24		71.45	76.26	64.88		71.50	74.86	74.21
SeqTR pre-trained+finetuned on REC	87.00	90.15	83.59		78.69	84.51	71.87		-	82.69	83.37
SeqTR on RES	67.26	69.79	64.12		54.14	58.93	48.19		-	55.67	55.64

SeqTR* denotes that its visual encoder is initialized with yolov3.weights, while the visual encoder of the rest are initialized with darknet.weights.

Contributing

Our codes are highly modularized and flexible to be extended to new architectures,. For instance, one can register new components such as head, fusion to promote your research ideas, or register new data augmentation techniques just as in mmdetection library. Feel free to play :-).

Citation

@article{zhu2022seqtr,
  title={SeqTR: A Simple yet Universal Network for Visual Grounding},
  author={Zhu, ChaoYang and Zhou, YiYi and Shen, YunHang and Luo, Gen and Pan, XingJia and Lin, MingBao and Chen, Chao and Cao, LiuJuan and Sun, XiaoShuai and Ji, RongRong},
  journal={arXiv preprint arXiv:2203.16265},
  year={2022}
}

Acknowledgement

Our code is built upon the open-sourced mmcv and mmdetection libraries.

SeqTR: A Simple yet Universal Network for Visual Grounding

Related tags

Overview

SeqTR

Installation

Prerequisites

Data Preparation

Training

Phrase Localization and Referring Expression Comprehension

Referring Expression Segmentation

Evaluation

Pre-training + fine-tuning

Models

Contributing

Citation

Acknowledgement

Owner

seanZhuh

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

A unified framework to jointly model images, text, and human attention traces.

Read number plates with https://platerecognizer.com/

This repository contains a PyTorch implementation of the paper Learning to Assimilate in Chaotic Dynamical Systems.

Official repository for "Orthogonal Projection Loss" (ICCV'21)

Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

Tensorforce: a TensorFlow library for applied reinforcement learning

The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Setup and customize deep learning environment in seconds.

The official repository for "Score Transformer: Generating Musical Scores from Note-level Representation" (MMAsia '21)

PyTorch Implementation of SSTNs for hyperspectral image classifications from the IEEE T-GRS paper "Spectral-Spatial Transformer Network for Hyperspectral Image Classification: A FAS Framework."

Tutorial for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop

SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Source code for the paper: Variance-Aware Machine Translation Test Sets (NeurIPS 2021 Datasets and Benchmarks Track)

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19