Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Last update: Jan 04, 2023

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:

Run our demo using Colab (no GPU needed):

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

Open-vocabulary LVIS

mask mAP mask mAP_novel

Box-Supervised 30.2 16.4

Detic 32.4 24.9

	mask mAP	mask mAP_novel
Box-Supervised	30.2	16.4
Detic	32.4	24.9

Standard LVIS

	Detector/ Backbone	mask mAP	mask mAP_rare
Box-Supervised	CenterNet2-ResNet50	31.5	25.6
Detic	CenterNet2-ResNet50	33.2	29.7
Box-Supervised	CenterNet2-SwinB	40.7	35.9
Detic	CenterNet2-SwinB	41.7	41.7

	Detector/ Backbone	box mAP	box mAP_rare
Box-Supervised	DeformableDETR-ResNet50	31.7	21.4
Detic	DeformableDETR-ResNet50	32.5	26.2

Cross-dataset generalization

Backbone Objects365 box mAP OpenImages box mAP50

Box-Supervised SwinB 19.1 46.2

Detic SwinB 21.4 55.2

	Backbone	Objects365 box mAP	OpenImages box mAP50
Box-Supervised	SwinB	19.1	46.2
Detic	SwinB	21.4	55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Owner

Meta Research

Official implementation of "Robust channel-wise illumination estimation"

[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Repository for the paper "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", CVPR 2021.

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

Generative Art Using Neural Visual Grammars and Dual Encoders

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

URIE: Universal Image Enhancementfor Visual Recognition in the Wild

Finite difference solution of 2D Poisson equation. Can handle Dirichlet, Neumann and mixed boundary conditions.

Your interactive network visualizing dashboard

Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021)

scikit-learn inspired API for CRFsuite

The undersampled DWI image using Slice-Interleaved Diffusion Encoding (SIDE) method can be reconstructed by the UNet network.

Python with OpenCV - MediaPip Framework Hand Detection

thundernet ncnn

DenseNet Implementation in Keras with ImageNet Pretrained Models

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Source code for the NeurIPS 2021 paper "On the Second-order Convergence Properties of Random Search Methods"

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)