Language Models Can See: Plugging Visual Controls in Text Generation

Overview

Language Models Can See: Plugging Visual Controls in Text Generation

Authors: Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier

This repository contains code, models, and other related resources of our paper [Language Models Can See: Plugging Visual Controls in Text Generation].

If you are also interested in open-ended text generation and would like to see more details of our contrastive search decoding method, please refer to our SimCTG [paper] and [repo].

Replicate has provided a great web [demo] of MAGIC that is super easy to use and to interact with. Check it out!


MAGIC


Catalogue:


1. Introduction:

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.


2. News:

  • [2022/05/06] MAGIC is publicly released!

3. Citation:

If you find our paper and resources useful, please kindly leave a star and cite our papers. Thanks!

@article{DBLP:journals/corr/abs-2205-02655,
  author    = {Yixuan Su and
               Tian Lan and
               Yahui Liu and
               Fangyu Liu and
               Dani Yogatama and
               Yan Wang and
               Lingpeng Kong and
               Nigel Collier},
  title     = {Language Models Can See: Plugging Visual Controls in Text Generation},
  journal   = {CoRR},
  volume    = {abs/2205.02655},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.02655},
  doi       = {10.48550/arXiv.2205.02655},
  eprinttype = {arXiv},
  eprint    = {2205.02655},
  timestamp = {Wed, 11 May 2022 17:29:40 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2205-02655.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-2202-06417,
  author    = {Yixuan Su and
               Tian Lan and
               Yan Wang and
               Dani Yogatama and
               Lingpeng Kong and
               Nigel Collier},
  title     = {A Contrastive Framework for Neural Text Generation},
  journal   = {CoRR},
  volume    = {abs/2202.06417},
  year      = {2022},
  url       = {https://arxiv.org/abs/2202.06417},
  eprinttype = {arXiv},
  eprint    = {2202.06417},
  timestamp = {Fri, 18 Feb 2022 12:23:53 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2202-06417.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

4. Environment Setup:

python version: 3.8
pip3 install -r requirements.txt

5. Zero-Shot Image Captioning:

5.1. Implementation of Experiments:

To ensure the reproductity of our work, we provide all related resources to implement our experiments on the task of zero-shot image captioning. Please refer more details [here].

5.2. Example Usage of Magic Search:

In the following, we illustrate how to perform zero-shot image captioning with magic search. Specifically, we show how to generate the results as shown in our case study in the paper.

Open In Colab

5.2.1. Load Language Model:

We first load the language model as:

import sys
sys.path.append(r'./image_captioning/language_model/')
from simctg import SimCTG
language_model_name = r'cambridgeltl/magic_mscoco'
sos_token, pad_token = r'<-start_of_text->', r'<-pad->'
generation_model = SimCTG(language_model_name, sos_token, pad_token)
generation_model.eval()
5.2.2. Load CLIP:

Then, we load the CLIP model as:

import sys
sys.path.append(r'./image_captioning/clip/')
from clip import CLIP
model_name = "openai/clip-vit-base-patch32"
clip = CLIP(model_name)
clip.eval()
5.2.3. Prepare Start Token:

Note that, the language model always starts generation with a start of sentence token. Here, we prepare the input ids of the start token.

import torch
sos_token = r'<-start_of_text->'
start_token = generation_model.tokenizer.tokenize(sos_token)
start_token_id = generation_model.tokenizer.convert_tokens_to_ids(start_token)
input_ids = torch.LongTensor(start_token_id).view(1,-1)
5.2.4. Load Image:

To generate the caption of a random image, we need to load the image as:

from PIL import Image             # to load images
from IPython.display import display # to display images
image_name_list = ['COCO_val2014_000000336777.jpg', 'COCO_val2014_000000182784.jpg', 'COCO_val2014_000000299319.jpg', 'COCO_val2014_000000516750.jpg',
                   'COCO_val2014_000000207151.jpg', 'COCO_val2014_000000078707.jpg', 'COCO_val2014_000000027440.jpg', 'COCO_val2014_000000033645.jpg',
                   'COCO_val2014_000000348905.jpg', 'COCO_val2014_000000545385.jpg', 'COCO_val2014_000000210032.jpg', 'COCO_val2014_000000577526.jpg']
index = 1 
'''
   you can easily reproduce all results shown in our case study (index from 0 to 3) 
   and the results in the appendix (index from 4 to 11).
'''

image_path = r'./image_captioning/example_images/' + image_name_list[index]
image_instance = Image.open(image_path)
display(image_instance)
5.2.5. Zero-Shot Image Captioning with Magic Search:

Now, let's generate the image caption with magic search!

'''
   setup the configurations of magic search
      k: the k in magic search
      alpha: the alpha in magic search
      beta: the beta in magic search
      decoding_len: the number of tokens to generate
'''
k, alpha, beta, decoding_len = 45, 0.1, 2.0, 16
eos_token = '<|endoftext|>'
output = generation_model.magic_search(input_ids, k, 
        alpha, decoding_len, beta, image_instance, clip, 60)
print (output)
'''
   A large cow standing in a street stall.
'''
5.2.6. Reproduce Our Results in the Paper:

If you would like to reproduce all the results shown in the case study and appendix of our paper, you can run this demo file as

python image_caption_demo.py

6. Visually Grounded Story Generation:

6.1. Implementation of Experiments:

To ensure the reproductity of our work, we provide all related resources to implement our experiments on the task of visually grounded story generation. Please refer more details [here].

6.2. Example Usage of Magic Search:

In the following, we illustrate how to perform visually grounded story generation with magic search. Specifically, we show how to generate the results as shown in our case study in the paper.

Open In Colab

6.2.1. Load Language Model:

We first load the language model and prepare the story title as:

import sys
sys.path.append(r'./story_generation/language_model')
from transformers import AutoTokenizer
from simctg import SimCTG
language_model_name = r'cambridgeltl/simctg_rocstories'
tokenizer = AutoTokenizer.from_pretrained(language_model_name)
generation_model = SimCTG(language_model_name, tokenizer.pad_token_id)
generation_model.eval()

import torch
title = 'Ice Cream Tasting <|endoftext|>'
title_tokens = tokenizer.tokenize(title)
title_id_list = tokenizer.convert_tokens_to_ids(title_tokens)
title_ids = torch.LongTensor(title_id_list).view(1,-1)
6.2.2. Load CLIP:

Then, we load the CLIP model as:

import sys
sys.path.append(r'./story_generation/clip')
from clip import CLIP
model_name = "openai/clip-vit-base-patch32"
clip = CLIP(model_name)
clip.eval()
6.3.2. Get the Related Image:

Next, let's get the images that are related to the story tile. We provide two ways of doing it as shown below:

6.3.2.1. Retrieve from Image Index:

The first way is to retrieve the images from a constructed image index. Before running the following commands, please make sure you have built the image index from scrath as described [here] or downloaded our provided image index as described [here].

After the image index is ready, we can load the image index as

# build image index
import sys
sys.path.append(r'./story_generation/image_index')
from imageindex import ImageIndex
index_path = r'./story_generation/data/image_index/images_index_data/index_matrix.txt'
mapping_dict_path = r'./story_generation/data/image_index/images_index_data/mapping_dict.json'
image_folder_prefix_path = r'./story_generation/data/image_index/images/'
index = ImageIndex(index_path, mapping_dict_path, image_folder_prefix_path, clip)

Then, we can retrieve the top-1 images as

image_name_list, image_instance_list = index.search_image(title, top_k=1)
'''
   image_name_list: the list of names of the retrieved images
   image_instance_list: the list of images that we retrieve
'''

Let's see the retrieved images we got

from IPython.display import display
# display the top-1 image
display(image_instance_list[0])
6.3.2.2. Directly Load Image:

Alternatively, if you have not prepared the image index, we have provided these the image in the repo. You can directly load it as

from PIL import Image
image_name_list = ['avopix-284658167.jpg']
image_instance_list = []
for name in image_name_list:
    image_path = r'./story_generation/example_images/' + name
    image_instance = Image.open(image_path)
    image_instance_list.append(image_instance)
6.3.3. Visually Grounded Story Generation with Magic Search:

[Note] Recall that, in this example, our story title is 'Ice Cream Tasting <|endoftext|>'.

Now, let's generate the story conditioned on the retrieved image

from IPython.display import display
k, alpha, beta, decoding_len  = 5, 0.6, 0.15, 100
'''
   The k, alpha, beta correspond to the k, alpha, beta in magic search
'''
image_instance = image_instance_list[0]
eos_token = r'<|endoftext|>'
output, _ = generation_model.magic_search(title_ids, k, alpha, decoding_len, beta, image_instance, 
        clip, 60, eos_token)
_, generated_story = generation_model.parse_generated_result(output, num_of_sentences_to_keep=5)
print (generated_story)
display(image_instance)
'''
   My family went to a ice cream shop. They ordered three flavors of ice cream. The first one was 
   strawberry, the second was chocolate, and the third was orange. I was excited to try all three 
   flavors. It was very good and I had a great time at the ice cream shop.
'''

Then, let's see what we can get using the vanilla contrastive search without the image grounding.

k, alpha, decoding_len  = 5, 0.6, 100
'''
   The k and alpha correspond to the k and alpha in contrastive search
'''
eos_token = r'<|endoftext|>'
output, _ = generation_model.fast_contrastive_search(title_ids, k, alpha, decoding_len, eos_token)
_, generated_story = generation_model.parse_generated_result(output, num_of_sentences_to_keep=5)
print (generated_story)
'''
   My family went to a ice cream shop. We ordered the Ice Cream Truck. It was delicious. The customer 
   service was terrible. We had to leave for another day.
'''
6.3.4. Reproduce Our Results in the Paper:

If you would like to reproduce all the results shown in the case study and appendix of our paper, you can run this demo file as

python story_generation_demo.py

7. Contact

If you have any questions, feel free to contact me via (ys484 at cam.ac.uk).


8. MAGIC Elsewhere

We thank the community's effort for extending MAGIC!

  • Replicate has provided a great [demo] of MAGIC that is super easy to use. Thanks for the effort!
Owner
Yixuan Su
I am a third-year (final-year) Ph.D. student at the Language Technology Lab of the University of Cambridge.
Yixuan Su
Implementation of: "Exploring Randomly Wired Neural Networks for Image Recognition"

RandWireNN Unofficial PyTorch Implementation of: Exploring Randomly Wired Neural Networks for Image Recognition. Results Validation result on Imagenet

Seung-won Park 684 Nov 02, 2022
The project was to detect traffic signs, based on the Megengine framework.

trafficsign 赛题 旷视AI智慧交通开源赛道,初赛1/177,复赛1/12。 本赛题为复杂场景的交通标志检测,对五种交通标志进行识别。 框架 megengine 算法方案 网络框架 atss + resnext101_32x8d 训练阶段 图片尺寸 最终提交版本输入图片尺寸为(1500,2

20 Dec 02, 2022
Voila - Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Current state of supervised and unsupervised depth completion methods

Awesome Depth Completion Table of Contents About Sparse-to-Dense Depth Completion Current State of Depth Completion Unsupervised VOID Benchmark Superv

224 Dec 28, 2022
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
Code for "FPS-Net: A convolutional fusion network for large-scale LiDAR point cloud segmentation".

FPS-Net Code for "FPS-Net: A convolutional fusion network for large-scale LiDAR point cloud segmentation", accepted by ISPRS journal of Photogrammetry

15 Nov 30, 2022
Projects for AI/ML and IoT integration for games and other presented at re:Invent 2021.

Playground4AWS Projects for AI/ML and IoT integration for games and other presented at re:Invent 2021. Architecture Minecraft and Lamps This project i

Vinicius Senger 5 Nov 30, 2022
[arXiv'22] Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation

Panoptic NeRF Project Page | Paper | Dataset Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation Xiao Fu*, Shangzhan zhang*,

Xiao Fu 111 Dec 16, 2022
CVNets: A library for training computer vision networks

CVNets: A library for training computer vision networks This repository contains the source code for training computer vision models. Specifically, it

Apple 1.1k Jan 03, 2023
Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Ibai Gorordo 35 Sep 07, 2022
A unified framework to jointly model images, text, and human attention traces.

connect-caption-and-trace This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attent

Meta Research 73 Oct 24, 2022
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating the Factual Consistency of Abstractive Text Summarization Authors: Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher Int

Salesforce 165 Dec 21, 2022
meProp: Sparsified Back Propagation for Accelerated Deep Learning

meProp The codes were used for the paper meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [pdf]

LancoPKU 107 Nov 18, 2022
Fast and exact ILP-based solvers for the Minimum Flow Decomposition (MFD) problem, and variants of it.

MFD-ILP Fast and exact ILP-based solvers for the Minimum Flow Decomposition (MFD) problem, and variants of it. The solvers are implemented using Pytho

Algorithmic Bioinformatics Group @ University of Helsinki 4 Oct 23, 2022
Safe Bayesian Optimization

SafeOpt - Safe Bayesian Optimization This code implements an adapted version of the safe, Bayesian optimization algorithm, SafeOpt [1], [2]. It also p

Felix Berkenkamp 111 Dec 11, 2022
GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

使用说明 适配 windows7以上 64位 原神1920x1080窗口(其他分辨率后续适配) 待更新渊下宫 English version is to be

Zero_Circle 209 Dec 28, 2022
Layered Neural Atlases for Consistent Video Editing

Layered Neural Atlases for Consistent Video Editing Project Page | Paper This repository contains an implementation for the SIGGRAPH Asia 2021 paper L

Yoni Kasten 353 Dec 27, 2022
TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation Zhaoyun Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li

DamoCV 25 Dec 16, 2022
Repository for open research on optimizers.

Open Optimizers Repository for open research on optimizers. This is a test in sharing research/exploration as it happens. If you use anything from thi

Ariel Ekgren 6 Jun 24, 2022
PyTorch code for the "Deep Neural Networks with Box Convolutions" paper

Box Convolution Layer for ConvNets Single-box-conv network (from `examples/mnist.py`) learns patterns on MNIST What This Is This is a PyTorch implemen

Egor Burkov 515 Dec 18, 2022