Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Overview

Sonnet finder

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Usage

This is a Python script that should run without a GPU or any other special hardware requirements.

  1. Install the required packages, e.g. via: pip install -r requirements.txt

  2. Prepare a plain text file, say input.txt, with text you want to make a sonnet out of (sonnet-ize? sonnet-ify?). It can have multiple sentences on the same line, but a sentence should not be split across multiple lines.

    For example, I used pandoc --to=plain --wrap=none to generate a text file from my LaTeX papers. You could also start by grabbing some text files from Project Gutenberg.

  3. Run sonnet finder: python sonnet_finder.py input.txt -o output.tsv

    Using -o will save a list of all extracted candidate phrases, sorted by rhyme pattern, so you can generate new sonnets more quickly (see below) or browse and cherry-pick from the candidates to make your own sonnet out of these lines.

    Either way, the script will output a full example sonnet to STDOUT (provided enough rhyming pairs in iambic pentameter were found).

  4. If you've saved an output.tsv file before, you can quickly generate new sonnets via python sonnet_remix.py output.tsv. Since the stress and pronunciation prediction can be slow on larger files, this is much better than re-running sonnet_finder.py if you want more automatically generated suggestions.

Examples

This is a sonnet (with cherry-picked lines) made out of my PhD thesis:

the application of existing tools
describe a mapping to a modern form
applying similar replacement rules
the base ensembles slightly outperform

hungarian, icelandic, portuguese
perform a similar evaluation
contemporary lexemes or morphemes
a single dataset in isolation

historical and modern language stages
the weighted combination of encoder
the german dative ending -e in phrases
predictions fed into the next decoder

in this example from the innsbruck letter
machine translation still remains the better

These stanzas are compiled from a couple of automatically-generated suggestions based on the abstracts of all papers published in 2021 in the ACL Anthology:

effective algorithm that enables
improvements on a wide variety
and training with adjudicated labels
anxiety and test anxiety

obtain remarkable improvements on
decoder architecture, which equips
associated with the lexicon
surprising personal relationships

the impact of the anaphoric one
complexity prediction competition
developed for a laboratory run
existing parsers typically condition

examples, while in practice, most unseen
evaluate translation tasks between

Here's the same using Moby Dick:

among the marble senate of the dead
offensive matters consequent upon
a crawling reptile of the land, instead
fifteen, eighteen, and twenty hours on

the lakeman now patrolled the barricade
egyptian tablets, whose antiquity
the waters seemed a golden finger laid
maintains a permanent obliquity

the pequod with the little negro pippin
and with a frightful roll and vomit, he
increased, besides perhaps improving it in
transparent air into the summer sea

the traces of a simple honest heart
the fishery, and not the thousandth part

(The emjambment in the third stanza here is a lucky coincidence; the script currently doesn't do any kind of syntactic analysis or attempt coherence between lines.)

How it works

This script relies on the grapheme-to-phoneme library g2p_en by Park & Kim to convert the English input text to phoneme sequences (i.e., how the text would be pronounced). I chose this because it's a pip-installable Python library that fulfills two important criteria:

  1. it's not restricted to looking up pronunciations in a dictionary, but can handle arbitrary words through the use of a neural model (although, obviously, this will not always be accurate);

  2. it provides stress information for each vowel (i.e., whether any given vowel should be stressed or unstressed, which is important for determining the poetic meter).

The script then scans the g2p output for occurrences of iambic pentameter, i.e. a 0101010101(0) pattern, additionally checking if they coincide with word boundaries.

For finding snippets that rhyme, I rely mostly on Ghazvininejad et al. (2016), particularly §3 (relaxing the iambic pentameter a bit by allowing words that end in 100) and §5.2 (giving an operational definition of "slant rhyme" that I mostly try to follow).

QNA (Questions Nobody Asked)

  • Why does the script sometimes output lines that don't rhyme or don't fit the iambic meter? This script can only be as good as the grapheme-to-phoneme algorithm that's used. It frequently fails on words it doesn't know (for example, it tries to rhyme datasets with Portuguese?!) and also usually fails on abbreviations. Maybe there's a better g2p library that could be used, or the existing g2p_en could be modified to accept a custom dictionary, so you could manually define pronunciations for commonly used words.

  • Could this script also generate other types of poems? Sure. You could start by changing the regex iambic_pentameter to something else; maybe a sequence of dactyls? There are some further hardcoded assumptions in the code about iambic pentameter in the function get_stress_and_boundaries() that might have to be modified.

  • Could this script generate poems in languages other than English? This would require a suitable replacement for g2p_en that predicts pronunciations and stress patterns for the desired language, as well as re-writing the code that determines whether two phrases can rhyme; see the comments in the script for details. In particular, the code for English uses ARPABET notation for the pronunciation, which won't be suitable for other languages.

  • Can this script generate completely novel phrases in the style of an input text? This script does not "hallucinate" any text or generate anything that wasn't already there in the input; if you want to do that, take a look at Deep-speare maybe.

etc.

Written by Marcel Bollmann, inspired by a tweet, licensed under the MIT License.

I'm not the first one to write a script like this, but it was a fun exercise!

Owner
Marcel Bollmann
Computational linguist, postdoc, programming enthusiast.
Marcel Bollmann
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.2k Dec 30, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

Benjamin Heinzerling 1.1k Jan 03, 2023
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022