Fake Shakespearean Text Generator

Overview

Fake Shakespearean Text Generator

This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts.

Files and folders of the project.

models folder

This folder contains to zip file, one for stateful model and the other for stateless model (this model files are fully saved model architectures,not just weights).

weights.zip

As you its name implies, this zip file contains the model's weights as checkpoint format (see tensorflow model save formats).

tokenizer.save

This file is an saved and trained (sure on the dataset) instance of Tensorflow Tokenizer (used at inference time).

shakespeare.txt

This file is the dataset and composed of regular texts (see below what does it look like).

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

train.py

Contains codes for training.

inference.py

Contains codes for inference.

How to Train the Model

A more depth look into train.py file


First, it gets the dataset from the specified url (line 11). Then reads the dataset to train the tokenizer object just mentioned above and trains the tokenizer (line 18). After training, encodes the dataset (line 24). Since this is a stateful model, all sequences in batch should be start where the sequences at the same index number in the previous batch left off. Let's say a batch composes of 32 sequences. The 33th sequence (i.e. the first sequence in the second batch) should exactly start where the 1st sequence (i.e. first sequence in the first batch) ended up. The second sequence in the 2nd batch should start where 2nd sequnce in first batch ended up and so on. Codes between line 28 and line 48 do this and result the dataset. Codes between line 53 and line 57 create the stateful model. Note that to be able to adjust recurrent_dropout hyperparameter you have to train the model on a GPU. After creation of model, a callback to reset states at the beginning of each epoch is created. Then the training start with the calling fit method and then model (see tensorflow' entire model save), model's weights and the tokenizer is saved.

Usage of the Model

Where the magic happens (inference.py file)


To be able use the model, it should first converted to a stateless model due to a stateful model expects a batch of inputs instead of just an input. To do this a stateless model with the same architecture of stateful model should be created. Codes between line 44 and line 49 do this. To load weights the model should be builded. After building weight are loaded to the stateless model. This model uses predicted character at time step t as an inputs at time t + 1 to predict character at t + 2 and this operation keep goes until the prediction of last character (in this case it 100 but you can change it whatever you want. Note that the longer sequences end up with more inaccurate results). To predict the next characters, first the provided initial character should be tokenized. preprocess function does this. To prevent repeated characters to be shown in the generated text, the next character should be selected from candidate characters randomly. The next_char function does this. The randomness can be controlled with temperature parameter (to learn usage of it check the comment at line 30). The complete_text function, takes a character as an argument, predicts the next character via next_char function and concatenates the predicted character to the text. It repeats the process until to reach n_chars. Last, the stateless model will be saved also.

Results

Effects of the magic


print(complete_text("a"))

arpet:
like revenge borning and vinged him not.

lady good:
then to know to creat it; his best,--lord


print(complete_text("k"))

ken countents.
we are for free!

first man:
his honour'd in the days ere in any since
and all this ma


print(complete_text("f"))

ford:
hold! we must percy and he was were good.

gabes:
by fair lord, my courters,
sir.

nurse:
well


print(complete_text("h"))

holdred?
what she pass myself in some a queen
and fair little heartom in this trumpet our hands?
the

Owner
Recep YILDIRIM
Software Imagineering
Recep YILDIRIM
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 03, 2023
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021