This code provides a PyTorch implementation for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition), as described in the paper.

Related tags

Deep LearningOTTER
Overview

Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation

This repository contains PyTorch evaluation code, training code and pretrained models for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition). Link to the paper.

Bichen Wu*, Ruizhe Cheng*, Peizhao Zhang, Tianren Gao, Joseph E. Gonzalez, Peter Vajda (* indicates equal contribution)

If you used this code for your experiments, please consider citing our paper:

@inproceedings{otter,
    Author = {Wu, Bichen and Cheng, Ruizhe and Zhang, Peizhao and Vajda, Peter and Gonzalez, Joseph E},
    Title = {Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation},
    Journal = {arXiv:2112.09445},
    Year = {2021}
}

And our related work:

@inproceedings{cheng2021data,
  title={Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation},
  author={Cheng, Ruizhe and Wu, Bichen and Zhang, Peizhao and Vajda, Peter and Gonzalez, Joseph E},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3119--3124},
  year={2021}
}

Model Zoo

OTTER achieves good zero-shot image recognition results on multi-labeled Google Open Images V6 and ImageNet10K from Tencent Images.

Dataset Method Image Encoder Text Encoder GOI [email protected]=1 GOI [email protected]=5 GOI [email protected]=10 IN10K [email protected]=1 IN10K [email protected]=5 IN10K [email protected]=10 url
CC 3M InfoNCE RN50 DeCLUTR-Sci-base 26.8 55.1 66.4 10.9 29.4 40.5 model
CC 3M LS RN50 DeCLUTR-Sci-base 26.3 55.9 67.5 10.1 29.6 39.8 model
CC 3M KD RN50 DeCLUTR-Sci-base 26.7 55.3 67.1 10.0 27.5 38.5 model
CC 3M OTTER RN50 DeCLUTR-Sci-base 29.1 59.6 70.9 12.0 31.8 42.1 model

Usage

First, git clone the repository

git clone https://github.com/facebookresearch/OTTER.git

Then, install required packkages using pip

conda create --name otter python=3.8
conda activate otter
pip install -r requirements.txt

Try out classifying with a pretrained OTTER or one of its baseline models.

import torch
from PIL import Image
import otter

device = "cuda" if torch.cuda.is_available() else "cpu"
temperature = 60

model, preprocess = otter.load("OTTER") # KD, LS, InfoNCE
model = model.to(device)

image = Image.open("doge.jpg")
image = preprocess(image).unsqueeze(0).to(device)
texts = ['photo of a dog', 'photo of a sofa', 'photo of a flower']

with torch.no_grad():
    features = model.forward_features(image, texts)
    image_logits, text_logits = model.compute_logits(features)
    image_logits *= temperature

    probs = image_logits.softmax(dim=-1).cpu().numpy()

print("Probs:", probs)  # Probs: [[0.92657197 0.00180788 0.07162025]]

Evaluation

You can evaluate a pretrained model with launch_scripts/eval.sh.

Note that for faster evaluation, we used FAISS for knn lookup. The result however will be slightly different from using sklearn knn functions.

Data preparation

Download the Conceptual Caption or YFCC 15M (subset of YFCC100M) dataset for training. Download Google Open Images's or ImageNet 10K's test set for evaluation.

Conceptual Captions

First, download Train-GCC-training.tsv, which contains captions and image urls, from the official CC website. Then, follow the instructions in this repo to efficiently download Conceptual Captions. After the download completes, there should be a downloaded_training_report.tsv. Make sure it's in the same cc root folder as Train-GCC-training.tsv along with the training folder that contains all the images.

Run python data/cc_preprocess.py --cc_root /data/cc to generate a processed_labels.csv, which contains paired image paths and captions. This preprocessing step filters out invalid images that can't be opened by PIL. Note that not all images in the conceptual captions dataset are available. In our case, we had 2911810 valid images from the train set of conceptual captions.

YFCC 15M

Follow the instructions in here to download the 15 million images which were used in training CLIP.

After downloading all the zip files, convert the zip files to datadings format (with compression if necessary). In data/yfcc.py, the YFCC dataset takes in the datadings folder.

Google Open Images

Download the test set of Google Open Images V6 from here. We have provided the class names and label annotations in the dataset_meta_data folder.

ImageNet 10K (from Tencent ML-Images)

You can also evaluate on the validation set of multi-labeled ImageNet 10K from Tencent ML-Images. Download the ImageNet portion of Tencent ML-Images from here. We have also included the class names and label annotations in the dataset_meta_data folder.

The datasets should be placed in the following way:

DATA_ROOT/
  cc/
    processed_labels.csv
    training/
      ... (images)
  open-images/
    test/
      ... (images)
  tencent/
    images/
      ... (images)

Single node training

You can launch training on a single node with scripts in launch_scripts.

Dataset Analysis

You can analyze the prevalence of the noisy matching problem with python3 data_analysis.py --data_root <data_root> --datasets cc --batch 512 --stop 1000. The script uses a pretrained OpenAI CLIP model to estimate the the on-diagonal vs off-diagonal matching scores of an image-caption dataset.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Owner
Meta Research
Meta Research
implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks

YOLOR implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks To reproduce the results in the paper, please us

Kin-Yiu, Wong 1.8k Jan 04, 2023
Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation Introduction WAKD is a PyTorch implementation for our ICPR-2022 pap

2 Oct 20, 2022
Multiple-criteria decision-making (MCDM) with Electre, Promethee, Weighted Sum and Pareto

EasyMCDM - Quick Installation methods Install with PyPI Once you have created your Python environment (Python 3.6+) you can simply type: pip3 install

Labrak Yanis 6 Nov 22, 2022
Official code for ICCV2021 paper "M3D-VTON: A Monocular-to-3D Virtual Try-on Network"

M3D-VTON: A Monocular-to-3D Virtual Try-On Network Official code for ICCV2021 paper "M3D-VTON: A Monocular-to-3D Virtual Try-on Network" Paper | Suppl

109 Dec 29, 2022
The implementation for "Comprehensive Knowledge Distillation with Causal Intervention".

Comprehensive Knowledge Distillation with Causal Intervention This repository is a PyTorch implementation of "Comprehensive Knowledge Distillation wit

Xiang Deng 10 Nov 03, 2022
[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Y

118 Dec 26, 2022
An experimental technique for efficiently exploring neural architectures.

SMASH: One-Shot Model Architecture Search through HyperNetworks An experimental technique for efficiently exploring neural architectures. This reposit

Andy Brock 478 Aug 04, 2022
The AugNet Python module contains functions for the fast computation of image similarity.

AugNet AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation arxiv link In our work, we propose AugNet, a new deep le

Ming 74 Dec 28, 2022
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

Hou zhijian 23 Dec 26, 2022
[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images

Unsupervised Object-Level Representation Learning from Scene Images This repository contains the official PyTorch implementation of the ORL algorithm

Jiahao Xie 55 Dec 03, 2022
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Conceptual 12M We introduce the Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-train

Google Research Datasets 226 Dec 07, 2022
Face Recognize System on camera AI OAK1

FRS on OAK1 Face Recognize System on camera OAK1 This project contains our work that deploy on camera OAK1 Features Anti-Spoofing Face detection Face

Tran Anh Tuan 6 Aug 08, 2022
The code uses SegFormer for Semantic Segmentation on Drone Dataset.

SegFormer_Segmentation The code uses SegFormer for Semantic Segmentation on Drone Dataset. The details for the SegFormer can be obtained from the foll

Dr. Sander Ali Khowaja 1 May 08, 2022
Accelerating BERT Inference for Sequence Labeling via Early-Exit

Sequence-Labeling-Early-Exit Code for ACL 2021 paper: Accelerating BERT Inference for Sequence Labeling via Early-Exit Requirement: Please refer to re

李孝男 23 Oct 14, 2022
Multi Task Vision and Language

12-in-1: Multi-Task Vision and Language Representation Learning Please cite the following if you use this code. Code and pre-trained models for 12-in-

Facebook Research 712 Dec 19, 2022
TYolov5: A Temporal Yolov5 Detector Based on Quasi-Recurrent Neural Networks for Real-Time Handgun Detection in Video

TYolov5: A Temporal Yolov5 Detector Based on Quasi-Recurrent Neural Networks for Real-Time Handgun Detection in Video Timely handgun detection is a cr

Mario Duran-Vega 18 Dec 26, 2022
Painting app using Python machine learning and vision technology.

AI Painting App We are making an app that will track our hand and helps us to draw from that. We will be using the advance knowledge of Machine Learni

Badsha Laskar 3 Oct 03, 2022
Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

Karan Desai 105 Nov 25, 2022
A new test set for ImageNet

ImageNetV2 The ImageNetV2 dataset contains new test data for the ImageNet benchmark. This repository provides associated code for assembling and worki

186 Dec 18, 2022
Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton 1.5k Dec 29, 2022