A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Last update: Aug 25, 2022

Overview

Persian-Image-Captioning

We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implementation of our model is in PyTorch with transformers library by Hugging Face( 🤗 ).

You can choose any pretrained vision model and any language model to use in the Vision Encoder Decoder model. Here we use ViT as the encoder, and ParsBERT (v2.0) as the decoder. The encoder and decoder are loaded separately via from_pretrained()function. Cross-attention layers are randomly initialized and added to the decoder.

You may refer to the Vision Encoder Decoder Model for more information.

How to use

You can generate caption of an image using this model using the code below:

import torch
import urllib
import PIL
import matplotlib.pyplot as plt
from transformers import ViTFeatureExtractor, AutoTokenizer, \
                         VisionEncoderDecoderModel

def show_img(image):
    # show image
    plt.axis("off")
    plt.imshow(image)
    
if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'


#pass the url of any image to generate a caption for it    
urllib.request.urlretrieve("https://images.unsplash.com/photo-1628191011227-522c7c3f0af9?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=870&q=80", "sample.png")
image = PIL.Image.open("sample.png")


#Load the model you trained for inference 
model_checkpoint = 'MahsaShahidi/Persian-Image-Captioning'
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
tokenizer = AutoTokenizer.from_pretrained('HooshvareLab/bert-fa-base-uncased-clf-persiannews')

sample = feature_extractor(image, return_tensors="pt").pixel_values.to(device)
caption_ids = model.generate(sample, max_length = 30)[0]
caption_text = tokenizer.decode(caption_ids, skip_special_tokens=True)
print(caption_text)
show_img(image)

Inference

Following are the reslts of 3 captions generated on free stock photos after 2 epochs of training.

Image	Caption
	Generated Caption: زنی در آشپزخانه در حال اماده کردن غذا است.
	Generated Caption: گروهی از مردم در حال پرواز بادبادک در یک زمین چمنزار.
	Generated Caption: مردی در ماشین نشسته و به ماشین نگاه می کند.

Credits

A huge thanks to Kaggle for providing free access to GPU, and to the creators of Huggingface, ViT, and ParsBERT!

References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Related tags

Overview

Persian-Image-Captioning

How to use

Inference

Credits

References

Owner

Hamtech-ai

Translate U is capable of translating the text present in an image from one language to the other.

Codename generator using WordNet parts of speech database

Code for Text Prior Guided Scene Text Image Super-Resolution

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

NeMo: a toolkit for conversational AI

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Constituency Tree Labeling Tool

Kurumi ChatBot

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Shared, streaming Python dict

This is a GUI program that will generate a word search puzzle image

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Simple GUI where you can enter an article and get a crisp summarized version.

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

原神抽卡记录数据集-Genshin Impact gacha data