To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Last update: Feb 08, 2022

Related tags

Text Data & NLP Eye_for_the_blind

Overview

Eye for the blind

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset. This kind of model is a use-case for blind people so that they can understand any image with the help of speech. The caption generated through a CNN-RNN model will be converted to speech using a text to speech library.

This problem statement is an application of both deep learning and natural language processing. The features of an image will be extracted by CNN-based encoder and this will be decoded by an RNN model.

The project is an extended application of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper. https://arxiv.org/abs/1502.03044

The dataset is taken from the Kaggle website and it consists of sentence-based image description having a list of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events of the image.

Project Pipeline

The project pipeline can be briefly summarized in the following four steps:

Data Understanding: Here, you need to load the data and understand the representation.
Data preprocessing: In this step, you will process both images and captions to the desired format.
Train/Test Split: Combine both images and captions to create the train and test dataset.
Model-Building: This is the stage where you will create your image captioning model by building Encoder , Attention and Decoder model.
Model Evaluation: Evaluate the models using greedy search and BLEU score.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Related tags

Overview

Eye for the blind

Project Pipeline

Owner

Ragesh Hajela

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

Share constant definitions between programming languages and make your constants constant again

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Random Directed Acyclic Graph Generator

Converts text into a PDF of handwritten notes

A script that automatically creates a branch name using google translation api and jira api

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Application for shadowing Chinese.

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Task-based datasets, preprocessing, and evaluation for sequence models.

A python package for deep multilingual punctuation prediction.

This is a GUI program that will generate a word search puzzle image

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec