Generate text captions for images from their CLIP embeddings. Includes PyTorch model code and example training script.

Overview

clip-text-decoder

Generate text captions for images from their CLIP embeddings. Includes PyTorch model code and example training script.

Example Predictions

Example captions were computed with the pretrained model mentioned below.

"A man riding a wave on top of a surfboard."

A surfer riding a wave

A baseball player is swinging a bat at a ball.

Baseball player

"A dog running across a field with a frisbee."

Dog with frisbee

Installation

Install for easier access to the following objects/classes:

  • clip_text_decoder.datasets.ClipCocoCaptionsDataset
  • clip_text_decoder.models.ClipDecoder
  • clip_text_decoder.models.ClipDecoderInferenceModel
  • clip_text_decoder.tokenizer.Tokenizer

The train.py script will not be available in the installed package, since it's located in the root directory. To train new models, either clone this repository or recreate train.py locally.

Using pip:

pip install clip-text-decoder

From source:

git clone https://github.com/fkodom/clip-text-decoder.git
cd clip-text-decoder
pip install .

NOTE: You'll also need to install openai/CLIP to encode images with CLIP. This is also required by ClipCocoCaptionsDataset to build the captions dataset the first time (cached for subsequent calls).

pip install "clip @ git+https://github.com/openai/CLIP.git"

For technical reasons, the CLIP dependency can't be included in the PyPI package, since it's not an officially published package.

Training

Open In Colab

Launch your own training session using the provided script (train.py):

python train.py --max-epochs 5

Training CLI arguments, along with their default values:

--max-epochs 5  # (int)
--num-layers 6  # (int)
--dim-feedforward 256  # (int)
--precision 16  # (16 or 32)
--seed 0  # (int)

Inference

The training script will produce a model.zip archive, containing the Tokenizer and trained model parameters. To perform inference with it:

import clip
from PIL import Image
import torch

from clip_text_decoder.model import ClipDecoderInferenceModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ClipDecoderInferenceModel.load("path/to/model.zip").to(device)
clip_model, clip_preprocessor = clip.load("ViT-B/32", device=device, jit=False)

# Create a blank dummy image
dummy_image = Image.new("RGB", (224, 224))
preprocessed = clip_preprocessor(dummy_image).to(device)
# Add a batch dimension using '.unsqueeze(0)'
encoded = clip_model.encode_image(preprocessed.unsqueeze(0))
text = model(encoded)

print(text)
# Probably some nonsense, because we used a dummy image.

Pretrained Models

A pretrained CLIP decoder is hosted in my Google Drive, and can easily be downloaded by:

from clip_text_decoder.model import ClipDecoderInferenceModel

model = ClipDecoderInferenceModel.download_pretrained()

To cache the pretrained model locally, so that it's not re-downloaded each time:

model = ClipDecoderInferenceModel.download_pretrained("/path/to/model.zip")

Shortcomings

  • Only works well with COCO-style images. If you go outside the distribution of COCO objects, you'll get nonsense text captions.
  • Relatively short training time. Even within the COCO domain, you'll occasionally see incorrect captions. Quite a few captions will have bad grammar, repetitive descriptors, etc.
Comments
  • Decoding Text Embeddings Coded Using Hugging Face ClipTextModel

    Decoding Text Embeddings Coded Using Hugging Face ClipTextModel

    Suppose that I have text embeddings created using Hugging Face's ClipTextModel using the following method:

    import torch
    from transformers import CLIPTokenizer, CLIPTextModel
    
    class_list = ["i love going home and playing with my wife and kids", "i love going home", "playing with my wife and kids", 
    "family", "war", "writing"]
    
    model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
    tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
    
    inputs = tokenizer(class_list, padding=True, return_tensors="pt")
    outputs = model(**inputs)
    hidden_state = outputs.last_hidden_state
    embeddings = outputs.pooler_output
    

    Questions:

    1. Is It possible to use the clip-text-decoder to convert the embeddings back to text?
    2. If it is indeed possible to do so, could you provide an example of how?

    Looking forward to receiving your feedback.

    opened by mbdzi 6
  • Fix string error when loading clip models.

    Fix string error when loading clip models.

    error

    The model name string ( VIT-xxx ) in the check_vision_backbone function is not compatible with the model name string ( ViT-xxx ) of the clip repository, which will cause at least one error in check_vision_backbone function or when loading the clip model.

    solution

    In this PR, the model name string in the check_vision_backbone function is modified to ViT-xxx to make it compatible with the clip repository.

    opened by Adenialzz 1
  • BLIP vision backbone

    BLIP vision backbone

    • Added blip backbone; still cleaning up last pieces
    • Bug fixes for training script, and remove debug code.
    • Fix dependencies in test workflow; update README statistics
    • Fix test issue with CUDA device
    • Update unit tests for newer Python, torch versions
    • Test up to Python 3.10
    • Test up to Python 3.9
    • Install lavis first
    opened by fkodom 0
  • Feature: Beam Search

    Feature: Beam Search

    • Add beam search, clip dependency to setup.py
    • Fix installation instructions
    • Remove main clause
    • Add '--beam-size' option to 'train.py' script.
    • Update README; propagate the '--beam-size' arg through eval functions
    • Update setup.cfg, add pre-commit hooks
    • Reformat images
    • Remove fixed image width
    • Add detail to README; comments to call method for beam search
    • Updated README headline
    opened by fkodom 0
  • Bug Fixes for Broken Tests

    Bug Fixes for Broken Tests

    • Cache the old fashioned way :)
    • Fix silly typo in test for image caption model
    • Apply black and isort formatting
    • Install latest version of 'black', reapply formatting
    • Fix flake8 issue (duplicate function definition), and install latest patch version of pytorch for tests.
    • Skip slow tests by default, add 'slow' marker to inference model tests.
    opened by fkodom 0
  • GPT2 Decoder

    GPT2 Decoder

    • Update model to use DistilGPT2 as a pre-trained decoder.
    • Removed tokenizer (no longer used), fixed bugs in Model source file, and updated model unit tests.
    • Backwards compatibility for 'gdown.download' method.
    • Update installation requirements, caption examples in README
    opened by fkodom 0
  • Upgrade CodeSee workflow to version 2

    Upgrade CodeSee workflow to version 2

    CodeSee is a code visibility platform.

    This change updates the CodeSee workflow file to the latest version for security, maintenance, and support improvements (see changelog below).

    That workflow file:

    • runs CodeSee's code analysis on every PR push and merge
    • uploads that analysis to CodeSee.
    • It does not transmit your code.

    The code analysis is used to generate maps and insights about this codebase.

    CodeSee workflow changelog:

    • Improved security: Updates permission to be read-only.
    • Improved future maintenance: Replaces the body of the workflow with a single github action: codesee-action. This makes it significantly easier for CodeSee to introduce future improvements and fixes without requiring another PR like this.
    • Improved Python support: The action now properly supports Python 3.11, and will continue to support new Python versions as they are released.
    opened by codesee-maps[bot] 1
  • Incompatible checksum error

    Incompatible checksum error

    I see the following error when trying to load the pretrained model.

        tokenizer=pickle.loads(tokenizer_buffer.read()),
      File "stringsource", line 6, in spacy.pipeline.trainable_pipe.__pyx_unpickle_TrainablePipe
    _pickle.PickleError: Incompatible checksums (102742709 vs 0x417ddeb = (cfg, model, name, vocab))
    

    Am I missing something?

    opened by dapurv5 0
Releases(1.4.4)
  • 1.4.4(Nov 7, 2022)

    What's Changed

    • Fix string error when loading clip models. by @Adenialzz in https://github.com/fkodom/clip-text-decoder/pull/12

    New Contributors

    • @Adenialzz made their first contribution in https://github.com/fkodom/clip-text-decoder/pull/12

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.3...1.4.4

    Source code(tar.gz)
    Source code(zip)
  • 1.4.3(Nov 7, 2022)

    What's Changed

    • Refactor Dataset by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/11

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.2...1.4.3

    Source code(tar.gz)
    Source code(zip)
  • 1.4.2(Oct 26, 2022)

    What's Changed

    • Huggingface Evaluate by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/9

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.1...1.4.2

    Source code(tar.gz)
    Source code(zip)
  • 1.4.1(Oct 26, 2022)

    What's Changed

    • Datapipes by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/8

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.4.0...1.4.1

    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Oct 23, 2022)

    What's Changed

    • BLIP vision backbone by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/7

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.3.0...1.4.0

    Source code(tar.gz)
    Source code(zip)
  • 1.3.0(Oct 2, 2022)

    What's Changed

    • Feature: Beam Search by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/5
    • Bug Fix: PyPI Release by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/6

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.2.0...1.3.0

    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Jan 29, 2022)

    What's Changed

    • Cache CLIP embeddings for the dataset, rather than recomputing them each time.

    • Reduce model file sizes by storing at lower precision

    • Add an ImageCaptionInferenceModel class for easier out-of-the-box use

    • Fix some broken unit tests

    • Better Data Caching by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/3

    • Bug Fixes for Broken Tests by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/4

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.1.0...1.2.0

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 22, 2021)

    What's Changed

    • GPT2 Decoder by @fkodom in https://github.com/fkodom/clip-text-decoder/pull/2

    New Contributors

    • @fkodom made their first contribution in https://github.com/fkodom/clip-text-decoder/pull/2

    Full Changelog: https://github.com/fkodom/clip-text-decoder/compare/1.0.0...1.1.0

    Source code(tar.gz)
    Source code(zip)
  • 0.1.1(Nov 14, 2021)

  • 0.1.0(Nov 14, 2021)

Owner
Frank Odom
Director of Innovation at Plainsight. I like neural nets, and neural nets like me.
Frank Odom
TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2)

TensorFlow Examples This tutorial was designed for easily diving into TensorFlow, through examples. For readability, it includes both notebooks and so

Aymeric Damien 42.5k Jan 08, 2023
This project intends to use SVM supervised learning to determine whether or not an individual is diabetic given certain attributes.

Diabetes Prediction Using SVM I explore a diabetes prediction algorithm using a Diabetes dataset. Using a Support Vector Machine for my prediction alg

Jeff Shen 1 Jan 14, 2022
g2o: A General Framework for Graph Optimization

g2o - General Graph Optimization Linux: Windows: g2o is an open-source C++ framework for optimizing graph-based nonlinear error functions. g2o has bee

Rainer Kümmerle 2.5k Dec 30, 2022
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Introduction This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset

Tao Ruijie 277 Dec 31, 2022
System Design course at HSE (2021)

System Design course at HSE (2021) Wiki-страница курса Структура репозитория: slides - директория с презентациями с занятий tasks - материалы для выпо

22 Dec 25, 2022
Code for the paper "Adversarial Generator-Encoder Networks"

This repository contains code for the paper "Adversarial Generator-Encoder Networks" (AAAI'18) by Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. Pr

Dmitry Ulyanov 279 Jun 26, 2022
Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

This repository provides the official code for replicating experiments from the paper: Semi-Supervised Semantic Segmentation with Pixel-Level Contrast

Iñigo Alonso Ruiz 58 Dec 15, 2022
An updated version of virtual model making

Model-Swap-Face v2   这个项目是基于stylegan2 pSp制作的,比v1版本Model-Swap-Face在推理速度和图像质量上有一定提升。主要的功能是将虚拟模特进行环球不同区域的风格转换,目前转换器提供西欧模特、东亚模特和北非模特三种主流的风格样式,可帮我们实现生产资料零成

seeprettyface.com 62 Dec 09, 2022
Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering

Nvdiffrast – Modular Primitives for High-Performance Differentiable Rendering Modular Primitives for High-Performance Differentiable Rendering Samuli

NVIDIA Research Projects 675 Jan 06, 2023
[ICCV2021] Learning to Track Objects from Unlabeled Videos

Unsupervised Single Object Tracking (USOT) 🌿 Learning to Track Objects from Unlabeled Videos Jilai Zheng, Chao Ma, Houwen Peng and Xiaokang Yang 2021

53 Dec 28, 2022
AI Face Mesh: This is a simple face mesh detection program based on Artificial intelligence.

AI Face Mesh: This is a simple face mesh detection program based on Artificial Intelligence which made with Python. It's able to detect 468 different

Md. Rakibul Islam 1 Jan 13, 2022
Level Based Customer Segmentation

level_based_customer_segmentation Level Based Customer Segmentation Persona Veri Seti kullanılarak müşteri segmentasyonu yapılmıştır. KOLONLAR : PRICE

Buse Yıldırım 6 Dec 21, 2021
Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World" (CVPR 2019 ORAL)

Large-Scale Long-Tailed Recognition in an Open World [Project] [Paper] [Blog] Overview Open Long-Tailed Recognition (OLTR) is the author's re-implemen

Zhongqi Miao 761 Dec 26, 2022
This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans. TABS relies on a Res-Unet backbone, with a Vision

6 Nov 07, 2022
NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

Göktuğ Karakaşlı 16 Dec 05, 2022
Multi-Stage Progressive Image Restoration

Multi-Stage Progressive Image Restoration Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Sh

Syed Waqas Zamir 859 Dec 22, 2022
Modelisation on galaxy evolution using PEGASE-HR

model_galaxy Modelisation on galaxy evolution using PEGASE-HR This is a labwork done in internship at IAP directed by Damien Le Borgne (https://github

Adrien Anthore 1 Jan 14, 2022
NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Xintao 593 Jan 03, 2023
Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

Twitter Research 108 Jan 05, 2023