Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Last update: Dec 11, 2022

Related tags

Deep Learning ViP

Overview

Visual Parser (ViP)

This is the official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers.

Key Features & TLDR

PyTorch Implementation of the ViP network. Check it out at models/vip.py
A fast and neat implementation of the relative positional encoding proposed in HaloNet, BOTNet and AANet.
A transformer-friendly FLOPS & Param counter that supports FLOPS calculation for einsum and matmul operations.

Prerequisite

Please refer to get_started.md.

Results and Models

All models listed below are evaluated with input size 224x224

Model	Top1 Acc	#params	FLOPS	Download
ViP-Tiny	79.0	12.8M	1.7G	Google Drive
ViP-Small	82.1	32.1M	4.5G	Google Drive
ViP-Medium	83.3	49.6M	8.0G	Coming Soon
ViP-Base	83.6	87.8M	15.0G	Coming Soon

To load the pretrained checkpoint, e.g. ViP-Tiny, simply run:

# first download the checkpoint and name it as vip_t_dict.pth
from models.vip import vip_tiny
model = vip_tiny(pretrained="vip_t_dict.pth")

Evaluation

To evaluate a pre-trained ViP on ImageNet val, run:

python3 main.py <data-root> --model <model-name> -b <batch-size> --eval_checkpoint <path-to-checkpoint>

Training from scratch

To train a ViP on ImageNet from scratch, run:

bash ./distributed_train.sh <job-name> <config-path> <num-gpus>

For example, to train ViP with 8 GPU on a single node, run:

ViP-Tiny:

bash ./distributed_train.sh vip-t-001 configs/vip_t_bs1024.yaml 8

ViP-Small:

bash ./distributed_train.sh vip-s-001 configs/vip_s_bs1024.yaml 8

ViP-Medium:

bash ./distributed_train.sh vip-m-001 configs/vip_m_bs1024.yaml 8

ViP-Base:

bash ./distributed_train.sh vip-b-001 configs/vip_b_bs1024.yaml 8

Profiling the model

To measure the throughput, run:

python3 test_throughput.py <model-name>

For example, if you want to get the test speed of Vip-Tiny on your device, run:

python3 test_throughput.py vip-tiny

To measure the FLOPS and number of parameters, run:

python3 test_flops.py <model-name>

Citing ViP

@article{vip,
  title={Visual Parser: Representing Part-whole Hierarchies with Transformers},
  author={Sun, Shuyang and Yue, Xiaoyu, Bai, Song and Torr, Philip},
  journal={arXiv preprint arXiv:2107.05790},
  year={2021}
}

Contact

If you have any questions, don't hesitate to contact Shuyang (Kevin) Sun. You can easily reach him by sending an email to [email protected].

Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Related tags

Overview

Visual Parser (ViP)

Key Features & TLDR

Prerequisite

Results and Models

Evaluation

Training from scratch

Profiling the model

Citing ViP

Contact

Owner

Shuyang Sun

Attack on Confidence Estimation algorithm from the paper "Disrupting Deep Uncertainty Estimation Without Harming Accuracy"

[CVPR2021] De-rendering the World's Revolutionary Artefacts

Incorporating Transformer and LSTM to Kalman Filter with EM algorithm

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Official PyTorch implementation of "Adversarial Reciprocal Points Learning for Open Set Recognition"

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Open CV - Convert a picture to look like a cartoon sketch in python

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Jingju baseline - A baseline model of our project of Beijing opera script generation

Motion planning algorithms commonly used on autonomous vehicles. (path planning + path tracking)

Pytorch version of VidLanKD: Improving Language Understanding viaVideo-Distilled Knowledge Transfer

使用yolov5训练自己数据集(详细过程)并通过flask部署

BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

Practical Single-Image Super-Resolution Using Look-Up Table

Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

Small-bets - Ergodic Experiment With Python

Character-Input - Create a program that asks the user to enter their name and their age

Fast algorithms to compute an approximation of the minimal volume oriented bounding box of a point cloud in 3D.

Assessing syntactic abilities of BERT