OMNIVORE is a single vision model for many different visual modalities

Last update: Dec 27, 2022

Related tags

Overview

Omnivore: A Single Model for Many Visual Modalities

OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.

Usage

Setup and Installation

conda create --name omnivore python=3.8
conda activate omnivore
conda install pytorch=1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch -c nvidia
conda install -c conda-forge -c pytorch -c defaults apex
conda install pytorchvideo

To run the notebook you may also need to install the follwing:

conda install jupyter nb_conda ipykernel
python -m ipykernel install --user --name omnivore

Run Inference

Follow the inference_tutorial.ipynb tutorial locally or for step by step instructions on how to run inference with an image, video and RGBD image.

Model Zoo

Name	IN1k Top 1	Kinetics400 Top 1	SUN RGBD Top 1	Model
Omnivore Swin T	81.2	78.9	62.3	weights
Omnivore Swin S	83.4	82.2	64.6	weights
Omnivore Swin B	84.0	83.3	65.4	weights
Omnivore Swin B (IN21k)	85.3	84.0	67.2	weights
Omnivore Swin L (IN21k)	86.0	84.1	67.1	weights

Numbers are based on Table 2. and Table 4. in the Omnivore Paper.

Torch Hub

Models can be loaded via torch hub e.g.

model = torch.hub.load("facebookresearch/omnivore", model="omnivore_swinB")

The class mappings for the datasets can be downloaded as follows:

wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json 
wget https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json 
wget https://dl.fbaipublicfiles.com/omnivore/sunrgbd_classnames.json

Citation

If this work is helpful in your research, please consider starring ⭐ us and citing:

@article{girdhar2022omnivore,
  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},
  year={2022}
}

Contributing

We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.

License

Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).

OMNIVORE is a single vision model for many different visual modalities

Related tags

Overview

Omnivore: A Single Model for Many Visual Modalities

Usage

Setup and Installation

Run Inference

Model Zoo

Torch Hub

Citation

Contributing

License

Owner

Meta Research

This package implements the algorithms introduced in Smucler, Sapienza, and Rotnitzky (2020) to compute optimal adjustment sets in causal graphical models.

A curated list of awesome resources combining Transformers with Neural Architecture Search

Code for "Diffusion is All You Need for Learning on Surfaces"

InsTrim: Lightweight Instrumentation for Coverage-guided Fuzzing

PyTorch wrapper for Taichi data-oriented class

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

Curated list of awesome GAN applications and demo

EfficientNetV2 implementation using PyTorch

This repo includes the CUB-GHA (Gaze-based Human Attention) dataset and code of the paper "Human Attention in Fine-grained Classification".

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)

DirectVoxGO reconstructs a scene representation from a set of calibrated images capturing the scene.

PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

CVPR 2021

Affine / perspective transformation in Pose Estimation with Tensorflow 2

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning"

Instance-level Image Retrieval using Reranking Transformers

Experiments for distributed optimization algorithms

Code base for reproducing results of I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics. NeurIPS (2021)