An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.

The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor classes easily to customize different input and output grid resolution.

To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9 CIDEr score in teacher-forcing setting and 129.6 in reinforcement learning setting when using ViT-B/32 CLIP model, which conflicts with the experiment results in CLIP-ViL paper[1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0 CIDEr score in CLIP-ViT-B_Transformer setting in COCO Captioning).

We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.

1. Supported CLIP Models

Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.

Visual Backbone CLIP Model Input Resolution Output Resolution Feature Map Downsample Grid Feature Shape Global Feature Shape
Standard RN101 RN101 224 x 224 7 x 7 None 49 x 2048 1 x 512
ViT-B/32 ViT-B/32 224 x 224 7 x 7 None 49 x 768 1 x 512
ViT-B/16 ViT-B/16 224 x 224 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 768 1 x 512
Customized RN101_448 RN101 448 x 448 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 2048 1 x 512
ViT-B/32_448 ViT-B/32 448 x 448 14 x 14 AvgPool2d(kernel_size=(2,2), stride=2) 49 x 768 1 x 512

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4 in ViT-B/16 and ViT-B/32_448 teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.

2.1 Teacher-forcing

Name [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
BUTD 76.4 60.3 46.5 35.8 28.2 56.7 116.6 21.3
RN101 77.3 61.3 47.7 36.9 28.7 57.5 120.6 21.8
ViT-B/32 76.4 60.3 46.5 35.6 28.1 56.7 116.9 21.2
ViT-B/16 78.0 62.1 48.2 37.2 28.8 57.6 122.3 22.1
RN101_448 78.1 62.3 48.4 37.5 29.0 58.0 122.9 22.2
ViT-B/32_448 75.8 59.6 45.9 35.1 27.8 56.3 114.2 21.0

2.2 Self-critical Reinforcement Learning

Name [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr-D SPICE
BUTD 80.5 65.4 51.1 39.2 29.1 58.7 130.0 23.0
RN101 - - - - - - - -
ViT-B/32 79.9 64.6 50.4 38.5 29.0 58.6 129.6 22.8
ViT-B/16 82.0 67.3 53.1 41.1 29.9 59.8 136.6 23.8
RN101_448 81.7 66.9 52.6 40.5 29.9 59.7 136.1 23.9
ViT-B/32_448 - - - - - - - -

3. Get Started

Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.

3.1 Requirements

  • PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • timm ≥ 0.4.5

3.2 Examples

  1. Use CLIP ViT-B/32 model to extract global textual features of MSCOCO sentences from dataset_coco.json in Karpathy's released annotations.
CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
    --anno dataset_coco.json \
    --output_dir ${TXT_OUTPUT_DIR} \
    --model_type_or_path 'ViT-B/32'
  1. Use CLIP ViT-B/16 model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'ViT-B/16' \
    --model_type_or_path 'ViT-B/16'
  1. Use CLIP RN101 model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101' \
    --model_type_or_path 'RN101'
  1. Use CLIP RN101 model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101_448' \
    --model_type_or_path 'RN101'

3.3 Speeding up feature extraction with Multiple GPUs

You can run the same script with same input list (i.e. --image_list or --anno) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.

4. License

MIT

5. Acknowledgement

This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

6. References

[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.

[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.

Owner
Jianjie(JJ) Luo
SYSU & JDAIR Joint-PhD candidate.
Jianjie(JJ) Luo
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Text classification on IMDB dataset using Keras and Bi-LSTM network

Text classification on IMDB dataset using Keras and Bi-LSTM Text classification on IMDB dataset using Keras and Bi-LSTM network. Usage python3 main.py

Hamza Rashid 2 Sep 27, 2022
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🤗 Contributing to OpenSpeech 🤗 OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 03, 2023
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022