Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Last update: Jan 04, 2023

Overview

MViTs Excel at Class-agnostic Object Detection

Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer and Ming-Hsuan Yang

Paper: https://arxiv.org/abs/2111.11430

Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on these findings, we develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs offer enhanced interactability with intelligible text queries.

Architecture overview of MViTs used in this work

Results

Class-agnostic OD performance of MViTs in comparison with uni-modal detector (RetinaNet) on several datasets. MViTs show consistently good results on all datasets.

Enhanced Interactability: Effect of using different intuitive text queries on the MDef-DETR class-agnostic OD performance. Combining detections from multiple queries captures varying aspects of objectness.

Generalization to Rare/Novel Classes: MDef-DETR class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions. The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset. The MViT achieves good recall values even for the classes with no or very few occurrences.

Open-world Object Detection: Effect of using class-agnostic OD proposals from MDef-DETR for pseudo labelling of unknowns in Open World Detector (ORE).

Pretraining for Class-aware Object Detection: Effect of using MDef-DETR proposals for pre-training of DETReg instead of Selective Search proposals.

Evaluation

The provided codebase contains the pre-computed detections for all datasets using ours MDef-DETR model. The provided directory structure is as follows,

-> README.md
-> LICENSE
-> get_eval_metrics.py
-> get_multi_dataset_eval_metrics.py
-> data
    -> voc2007
        -> combined.pkl
    -> coco
        -> combined.pkl
    -> kitti
        -> combined.pkl
    -> kitchen
        -> combined.pkl
    -> cliaprt
        -> combined.pkl
    -> comic
        -> combined.pkl
    -> watercolor
        -> combined.pkl
    -> dota
        -> combined.pkl

Where combined.pkl contains the combined detections from multiple intutive text queries for corresponding datasets. (Refer Section 5.1: Enhanced Interactability for more details)

Download the annotations for all datasets and arrange them as shown below. Note that the script expect COCO annotations in standard COCO format & annotations of all other datasets in VOC format.

...
...
-> data
    -> voc2007
        -> combined.pkl
        -> Annotations
    -> coco
        -> combined.pkl
        -> instances_val2017_filt.json
    -> kitti
        -> combined.pkl
        -> Annotations
        ...
    -> kitchen
        -> combined.pkl
        -> Annotations
    -> cliaprt
        -> combined.pkl
        -> Annotations
    -> comic
        -> combined.pkl
        -> Annotations
    -> watercolor
        -> combined.pkl
        -> Annotations
    -> dota
        -> combined.pkl
        -> Annotations

Once the above mentioned directory structure is created, follow the following steps to calculate the metrics.

Install numpy

$ pip install numpy

Calculate metrics

$ python get_multi_dataset_eval_metrics.py

The calculated metrics will be stored in a data.csv file in the same directory.

Citation

If you use our work, please consider citing:

@article{Maaz2021Multimodal,
    title={Multi-modal Transformers Excel at Class-agnostic Object Detection},
    author={Muhammad Maaz and Hanoona Rasheed and Salman Khan and Fahad Shahbaz Khan and Rao Muhammad Anwer and Ming-Hsuan Yang},
    journal={ArXiv 2111.11430},
    year={2021}
}

Contact

Should you have any question, please contact [email protected] or [email protected]

🚀 Note: The repository contains the minimum evaluation code. The complete training and inference scripts along with pretrained models will be released soon. Stay Tuned!

Comments

aligning image text pairs

I have a question on the paper: you train on aligned image-text pairs. How do you create this alignment? is it the same way as in MDeTr? I did not fully understand from the paper, especially for non-natural images like satellite images or medical images.

opened by nikky4D 6
Loading checkpoints for inference

Which checkpoints in drive link you provided will load correctly in default MDefDETR model without any errors? Im getting missing/unexpected keys errors.
documentation

opened by KaleemW 4
Is EMA used in this work?

Hello author, thanks for your great work. I raise a question about the usage of Exponential Moving Average (EMA) in this paper, hoping you can provide me with some clues. It seems that this paper does not detail in this part. As far as I know, MDETR uses it and evaluate use the EMA model. So I wonder is it used in this work? If it is actually used, why should we evaluate by the EMA model rather than the original one?

opened by JacobYuan7 4
one of the variables needed for gradient computation has been modified by an inplace operation

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 20]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

This error will terminate the training procedure when training mdef_detr using the PyTorch environment as you advise(torch==1.8.0+cu111).

And I found the variables of 'transformer.text_encoder.pooler.dense.weight' does not have grad. This may be the main reason for this error.

opened by xushilin1 2
Loading the Faster RCNN checkpoint

Greetings

The readme states: (Feb 01, 2022) Training codes for MDef-DETR and MDef-DETR minus Language models are released -> training/README.md Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications are released -> applications/README.md All the pretrained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results are released -> this link

Following the link to the google drive, only provides me with the model weight for the Faster-RCNN, but not with instructions on how to load it and which framework to use. I have tried creating a Faster-RCNN-resnet101 model with pytorch, but when I load the model weight, it states that the layer names does not match. Any guidance would be much appreciated.

Best regards Martin
uni-modal-detectors

opened by MartinPedersenpp 2
Need to understand how to import weights

Hello,

Firstly, I'd like to congratulate you for bringing this amazing work. Class agnostic object detection is much needed currently in the industry and this would be a great way to solve the problem.

I wanted to test your model on some custom data. However, I cannot import pre-trained weights from the link you have provided. I can see the zip file but I couldn't find a way to import them. I'm using OpenCV to import weights. It is asking me to have a config file as well as .weights file.

Could you please help me which library to use to import weights when I'm working on a jupyter notebook?

Thank you,

opened by abhi-vellala 2
pretrain data download

if is it possible to split pretrain data into multiple seperate zip files。 I download data from google drive : https://drive.google.com/drive/folders/1-3kAsyZIVFbNelRXrF93Y5tMgOypv2jV i cannot download this data because of google drive time limit(less than 1 hours) and my limit network bandwidth。
documentation

opened by zhouxingguang 1
Training code release
This pull request adds

Training codes for MDef-DETR and MDef-DETR minus Language models

Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications

All the pre-trained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results
opened by mmaaz60 0
Questions about your training procedure?

To my understanding, I think you use image-text pairs as inputs and only bbox annotations as supervision signals without any class labels, does it right?

opened by GYslchen 1
Questions about your pretrained model

Does the pre-trained model you provide cover the categories on LVIS data? If I want to do open-world object detection on the LVIS dataset, can I directly use your pre-trained model to generate the proposals or should I need to filter the dataset so that it doesn't contain any object in the LVIS dataset?

opened by chengsilin 1
how to generate 'tokens_positive' ann from detector dataset like object365?

I found 'tokens_positive' was used in your ann file. could you please release the code of how to process detect data like coco to get the 'tokens_positive' ann results?
documentation

opened by zhouxingguang 1

Releases(v1.0)

v1.0(Feb 1, 2022)
Training codes for MDef-DETR and MDef-DETR minus Language models are released -> training/README.md

Instructions to use class-agnostic object detection behavior of MDef-DETR on different applications are released -> applications/README.md

All the pretrained models (MDef-DETR, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others), along with the instructions to reproduce the results are released -> this link

Source code(tar.gz)
Source code(zip)
v0.1(Nov 25, 2021)
Evaluation Code & Pre-trained Models

Releases evaluation code for MDef-DETR and 'MDef-DETR w/o Language Branch' model

Releases the pre-trained weights for both models

Releases the pre-computed predictions for both the models

Source code(tar.gz)
Source code(zip)

Owner

Muhammad Maaz

An Electrical Engineer with experience in Computer Vision software development. Skilled in Machine Learning, Deep Learning and Computer Vision.

GitHub Repository

Plug-n-Play Reinforcement Learning in Python with OpenAI Gym and JAX

coax is built on top of JAX, but it doesn't have an explicit dependence on the jax python package. The reason is that your version of jaxlib will depend on your CUDA version.

128 Dec 27, 2022

Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

380 Dec 19, 2022

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

65 Jan 07, 2023

Vehicle speed detection with python

Vehicle-speed-detection In the project simulate the tracker.py first then simulate the SpeedDetector.py. Finally, a new window pops up and the output

3 Dec 15, 2022

A PyTorch Implementation of FaceBoxes

FaceBoxes in PyTorch By Zisian Wong, Shifeng Zhang A PyTorch implementation of FaceBoxes: A CPU Real-time Face Detector with High Accuracy. The offici

797 Dec 17, 2022

10x faster matrix and vector operations

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations. If yo

2.3k Jan 09, 2023

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

AugMax: Adversarial Composition of Random Augmentations for Robust Training Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, an

112 Nov 07, 2022

Simple Pixelbot for Diablo 2 Resurrected written in python and opencv.

Simple Pixelbot for Diablo 2 Resurrected written in python and opencv. Obviously only use it in offline mode as it is against the TOS of Blizzard to use it in online mode!

468 Jan 03, 2023

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

1 Jan 26, 2022

Code for "AutoMTL: A Programming Framework for Automated Multi-Task Learning"

AutoMTL: A Programming Framework for Automated Multi-Task Learning This is the website for our paper "AutoMTL: A Programming Framework for Automated M

40 Dec 04, 2022

A collection of resources and papers on Diffusion Models, a darkhorse in the field of Generative Models

This repository contains a collection of resources and papers on Diffusion Models and Score-based Models. If there are any missing valuable resources

5.1k Jan 08, 2023

FluxTraining.jl gives you an endlessly extensible training loop for deep learning

A flexible neural net training library inspired by fast.ai

86 Dec 31, 2022

FADNet++: Real-Time and Accurate Disparity Estimation with Configurable Networks

6 Nov 18, 2022

Keywords : Streamlit, BertTokenizer, BertForMaskedLM, Pytorch

Next Word Prediction Keywords : Streamlit, BertTokenizer, BertForMaskedLM, Pytorch 🎬 Project Demo ✔ Application is hosted on Streamlit. You can see t

3 Aug 26, 2022

Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

106 Jan 06, 2023

A python library for self-supervised learning on images.

Lightly is a computer vision framework for self-supervised learning. We, at Lightly, are passionate engineers who want to make deep learning more effi

2k Jan 08, 2023

Noise Conditional Score Networks (NeurIPS 2019, Oral)

Generative Modeling by Estimating Gradients of the Data Distribution This repo contains the official implementation for the NeurIPS 2019 paper Generat

451 Dec 26, 2022

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

LEXA Benchmark Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper (Discovering and Achieving Goals via World Models

36 Dec 22, 2022

Walk with fastai

Shield: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Walk with fastai What is this p

124 Dec 10, 2022

Some bravo or inspiring research works on the topic of curriculum learning.

Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtu

131 Jan 07, 2023