Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

Overview

imgbeddings

A Python package to generate embedding vectors from images, using OpenAI's robust CLIP model via Hugging Face transformers. These image embeddings, derived from an image model that has seen the entire internet up to mid-2020, can be used for many things: unsupervised clustering (e.g. via umap), embeddings search (e.g. via faiss), and using downstream for other framework-agnostic ML/AI tasks such as building a classifier or calculating image similarity.

  • The embeddings generation models are ONNX INT8-quantized, meaning they're 20-30% faster on the CPU, much smaller on disk, and doesn't require PyTorch or TensorFlow as a dependency!
  • Works for many different image domains thanks to CLIP's zero-shot performance.
  • Includes utilities for using principal component analysis (PCA) to reduces the dimensionality of generated embeddings without losing much info.

Real-World Demo Notebooks

You can read how to use imgbeddings for real-world use cases in these Jupyter Notebooks:

Installation

aitextgen can be installed from PyPI:

pip3 install imgbeddings

Quick Example

Let's say you want to generate an image embedding for a cute cat photo. First you can download the photo:

import requests
from PIL import Image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

Then, you can load imgbeddings. By default, imgbeddings will load a 88MB model based on the patch32 variant of CLIP, which separates each image into 49 32x32 patches.

from imgbeddings import imgbeddings
ibed = imgbeddings()

You can also load the patch16 model by passing patch_size = 16 to imgbeddings() (more granular embeddings but takes about 3x longer to run), or the "large" patch14 model with patch_size = 14 (3.5x model size, 3x longer than patch16).

Then to generate embeddings, all you have to is pass the image to to_embeddings()!

embedding = ibed.to_embeddings(image)
embedding[0][0:5] # array([ 0.914541, 0.45988417, 0.0350069 , -0.9054574 , 0.08941309], dtype=float32)

This returns a 768D numpy vector for each input, which can be used for pretty much anything in the ML/AI world. You can also pass a list of filename and/or PIL Images for batch embeddings generation.

See the Demo Notebooks above for more advanced parameters and real-world use cases. More formal documentation will be added soon.

Ethics

The official paper for CLIP explicitly notes that there are inherent biases in the finished model, and that CLIP shouldn't be used in production applications as a result. My perspective is that having better tools free-and-open-source to detect such issues and make it more transparent is an overall good for the future of AI, especially since there are less-public ways to create image embeddings that aren't as accessible. At the least, this package doesn't do anything that wasn't already available when CLIP was open-sourced in January 2021.

If you do use imgbeddings for your own project, I recommend doing a strong QA pass along a diverse set of inputs for your application, which is something you should always be doing whenever you work with machine learning, biased models or not.

imgbeddings is not responsible for malicious misuse of image embeddings.

Design Notes

  • Note that CLIP was trained on square images only, and imgbeddings will pad and resize rectangular images into a square (imgbeddings deliberately does not center crop). As a result, images too wide/tall (e.g. more than a 3:1 ratio of largest dimension to smallest) will not generate robust embeddings.
  • This package only works with image data intentionally as opposed to leveraging CLIP's ability to link image and text. For downstream tasks, using your own text in conjunction with an image will likely give better results. (e.g. if training a model on an image embeddings + text embeddings, feed both and let the model determine the relative importance of each for your use case)

For more miscellaneous design notes, see DESIGN.md.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

See Also

License

MIT

You might also like...
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

 Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Simple image captioning model -  CLIP prefix captioning.
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Implementation of
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Comments
  • multiple classes

    multiple classes

    Excuse me, I'm trying to use the work to clustering 4-classes datasets, while I following the instructions in "cat_dogs.ipynb", when using: umap.plot.points, raise a ValueError: "Plotting is currently only implemented for 2D embeddings", I pretty sure I follow the data structure as the repo given. Does it mean it just support binary classes? Thanks a lot~

    opened by CinKKKyo 3
  • Embeddings vary slightly when done in batches vs. single

    Embeddings vary slightly when done in batches vs. single

    import requests
    from PIL import Image
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    from imgbeddings import imgbeddings
    ibed = imgbeddings()
    
    embedding = ibed.to_embeddings(image)
    embedding[:, 0:5] 
    
    array([[ 0.914541  ,  0.45988417,  0.0350069 , -0.9054574 ,  0.08941309]],
          dtype=float32)
    
    embedding = ibed.to_embeddings([image]*4)
    embedding[:, 0:5] 
    
    array([[ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
           [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635]],
          dtype=float32)
    

    Probably a side effect of ONNX conversion as that's within tolerances. (or a case where intra op is breaking parallelism?)

    bug 
    opened by minimaxir 0
  • Allow imgbeddings to optionally split an image into parts for more robust embeddings

    Allow imgbeddings to optionally split an image into parts for more robust embeddings

    Let's say you want to split the image into quadrants (2 row x 2 col)

    • Run each image as a batch of 4 inputs, with each input representing a quadrant
    • Hstack/contatenate the outputs to create a 768 * 4 vector (3072D)
    • PCA to get it down to a reasonable size to avoid curse-of-dimensionality shenanigans

    This should work since CLIP was trained with center/random cropping so the model should be resilient to subsets.

    Since the outcome of a 2x2 would give a maximum robustness for 448x448 images, which is still low, it may be worth it to scale it up/allow arbitrary segments (e.g. 4x4 for 896x896 images, or rectangular inputs) if the image resolution of the input data is consistent (e.g. 1024x1024 for StyleGAN shenanigans).

    enhancement 
    opened by minimaxir 1
Owner
Max Woolf
Data Scientist @buzzfeed. Plotter of pretty charts.
Max Woolf
Official code for 'Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning' [ICCV 2021]

RTFM This repo contains the Pytorch implementation of our paper: Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Lear

Yu Tian 242 Jan 08, 2023
The VeriNet toolkit for verification of neural networks

VeriNet The VeriNet toolkit is a state-of-the-art sound and complete symbolic interval propagation based toolkit for verification of neural networks.

9 Dec 21, 2022
Code for the paper Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration

IMAGINE: Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration This repo contains the code base of the paper Language as a Cog

Flowers Team 26 Dec 22, 2022
Code for intrusion detection system (IDS) development using CNN models and transfer learning

Intrusion-Detection-System-Using-CNN-and-Transfer-Learning This is the code for the paper entitled "A Transfer Learning and Optimized CNN Based Intrus

Western OC2 Lab 38 Dec 12, 2022
YolactEdge: Real-time Instance Segmentation on the Edge

YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7

Haotian Liu 1.1k Jan 06, 2023
[ECCV2020] Content-Consistent Matching for Domain Adaptive Semantic Segmentation

[ECCV20] Content-Consistent Matching for Domain Adaptive Semantic Segmentation This is a PyTorch implementation of CCM. News: GTA-4K list is available

Guangrui Li 88 Aug 25, 2022
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 07, 2022
Real-time analysis of intracranial neurophysiology recordings.

py_neuromodulation Click this button to run the "Tutorial ML with py_neuro" notebooks: The py_neuromodulation toolbox allows for real time capable pro

Interventional Cognitive Neuromodulation - Neumann Lab Berlin 15 Nov 03, 2022
[ICLR 2022] Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics

CPDeform Code and data for paper Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics at ICLR 2022 (Spotlight). @InProceed

(Lester) Sizhe Li 29 Nov 29, 2022
Code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2021

The repo provides the code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2

Yuning Mao 18 May 24, 2022
A diff tool for language models

LMdiff Qualitative comparison of large language models. Demo & Paper: http://lmdiff.net LMdiff is a MIT-IBM Watson AI Lab collaboration between: Hendr

Hendrik Strobelt 27 Dec 29, 2022
Course about deep learning for computer vision and graphics co-developed by YSDA and Skoltech.

Deep Vision and Graphics This repo supplements course "Deep Vision and Graphics" taught at YSDA @fall'21. The course is the successor of "Deep Learnin

Yandex School of Data Analysis 160 Jan 02, 2023
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

3 May 19, 2022
Local-Global Stratified Transformer for Efficient Video Recognition

DualFormer This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model

Sea AI Lab 19 Dec 07, 2022
sktime companion package for deep learning based on TensorFlow

NOTE: sktime-dl is currently being updated to work correctly with sktime 0.6, and wwill be fully relaunched over the summer. The plan is Refactor and

sktime 573 Jan 05, 2023
Bytedance Inc. 2.5k Jan 06, 2023
[CoRL 2021] A robotics benchmark for cross-embodiment imitation.

x-magical x-magical is a benchmark extension of MAGICAL specifically geared towards cross-embodiment imitation. The tasks still provide the Demo/Test

Kevin Zakka 36 Nov 26, 2022
Simple tutorials using Google's TensorFlow Framework

TensorFlow-Tutorials Introduction to deep learning based on Google's TensorFlow framework. These tutorials are direct ports of Newmu's Theano Tutorial

Nathan Lintz 6k Jan 06, 2023