Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Overview

Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship Attribution

Abstract

Within the Latin (and ancient Greek) production, it is well known that peculiar metric schemes were followed not only in poetic compositions, but also in many prose works. Such metric patterns were based on syllabic quantity, i.e., on on the length of the involved syllables (which can be long or short), and there is much evidence suggesting that certain authors held a preference for certain rhythmic schemes over others.
In this project, we investigate the possibility to employ syllabic quantity as a base to derive rhythmic features for the task of computational Authorship Attribution of Latin prose texts. We test the impact of these features on the attribution task when combined with other topic-agnostic features, employing three datasets and two different learning algorithms.

Syllabic Quantity for Authorship Attribution

Authorship Attribution (AA) is a subtask of the field of Authorship Analysis, which aims to infer various characteristics of the writer of a document, its identity included. In particular, given a set of candidate authors A1... Am and a document d, the goal of AA is to find the most probable author for the document d among the set of candidates; AA is thus a single-label multi-class classification problem, where the classes are the authors in the set.
In this project, we investigate the possibility to employ features extracted from the quantity of the syllables in a document as discriminative features for AA on Latin prose texts. Syllables are sound units a single word can be divided into; in particular, a syllable can be thought as an oscillation of sound in the word pronunciation, and is characterized by its quantity (long or short), which is the amount of time required to pronounce it. It is well known that classical Latin (and Greek) poetry followed metric patterns based on sequences of short and long syllables. In particular, syllables were combined in what is called a "foot", and in turn a series of "feet" composed the metre of a verse. Yet, similar metric schemes were followed also in many prose compositions, in order to give a certain cadence to the discourse and focus the attention on specific parts. In particular, the end of sentences and periods was deemed as especially important in this sense, and known as clausola. During the Middle Ages, Latin prosody underwent a gradual but profound change. The concept of syllabic quantity lost its relevance as a language discriminator, in favour of the accent, or stress. However, Latin accentuation rules are largely dependent on syllabic quantity, and medieval writers retained the classical importance of the clausola, which became based on stresses and known as cursus. An author's practice of certain rhythmic patterns might thus play an important role in the identification of that author's style.

Datasets

In this project, we employ 3 different datasets:

  • LatinitasAntiqua. The texts can be automatically downloaded with the script in the corresponding code file (src/dataset_prep/LatinitasAntiqua_prep.py). They come from the Corpus Corporum repository, developed by the University of Zurich, and in particular its sub-section called Latinitas antiqua, which contains various Latin works from the Perseus Digital library; in total, the corpus is composed of 25 Latin authors and 90 prose texts, spanning through the Classical, Imperial and Early-Medieval periods, and a variety of genres (mostly epistolary, historical and rhetoric).
  • KabalaCorpusA. The texts can be downloaded from the followig [link](https://www.jakubkabala.com/gallus-monk/). In particular, we use Corpus A, which consists of 39 texts by 22 authors from the 11-12th century.
  • MedLatin. The texts can be downloaded from the following link: . Originally, the texts were divided into two datasets, but we combine them together. Note that we exclude the texts from the collection of Petrus de Boateriis, since it is a miscellanea of authors. We delete the quotations from other authors and the insertions in languages other than Latin, marked in the texts.
The documents are automatically pre-processed in order to clean them from external information and noise. In particular, headings, editors' notes and other meta-information are deleted where present. Symbols (such as asterisks or parentheses) and Arabic numbers are deleted as well. Punctuation marks are normalized: every occurrence of question and exclamation points, semicolons, colons and suspension points are exchanged with a single point, while commas are deleted. The text is finally lower-cased and normalized: the character v is exchanged with the character u and the character j with the character i, and every stressed vowels is exchanged with the corresponding non-stressed vowel. As a final step, each text is divided into sentences, where a sentence is made of at least 5 distinct words (shorter sentences are attached to the next sentence in the sequence, or the previous one in case it is the last one in the document). This allows to create the fragments that ultimately form the training, validation and and test sets for the algorithms. In particular, each fragment is made of 10 consecutive, non-overlapping sentences.

Experiments

In order to transform the Latin texts into the corresponding syllabic quantity (SQ) encoding, we employ the prosody library available on the [Classical Language ToolKit](http://cltk.org/).
We also experiment with the four Distortion Views presented by [Stamatatos](https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/asi.23968?casa_token=oK9_O2SOpa8AAAAA%3ArLsIRzk4IhphR7czaG6BZwLmhh9mk4okCj--kXOJolp1T70XzOXwOw-4vAOP8aLKh-iOTar1mq8nN3B7), which, given a list Fw of function words, are:

  • Distorted View โ€“ Multiple Asterisks (DVMA): every word not included in Fw is masked by replacing each of its characters with an asterisk.
  • Distorted View โ€“ Single Asterisk (DVSA): every word not included in Fw is masked by replacing it with a single asterisk.
  • Distorted View โ€“ Exterior Characters (DVEX): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the first and last one.
  • Distorted View โ€“ Last 2 (DVL2): every word not included in Fw is masked by replacing each of its characters with an asterisk, except the last two characters.
BaseFeatures (BFs) and it's made of: function words, word lengths, sentence lengths.
We experiment with two different learning methods: Support Vector Machine and Neural Network. All the experiments are conducted on the same train-validation-test split.
For the former, we compute the TfIdf of the character n-grams in various ranges, extracted from the various encodings of the text, which we concatenate to BaseFeatures, and feed the resulting features matrix to a LinearSVC implemented in the [scikit-learn package](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).
For the latter, we compute various parallel, identical branches, each one processing a single encoding or the Bfs matrix, finally combining the different outputs into a single decision layer. The network is implimented with the [PyTorch package](https://pytorch.org/). Each branch outputs a matrix of probabilities, which are stacked together, and an average-pooling operation is applied in order to obtain the average value of the decisions of the different branches. The final decision is obtained through a final dense layer applying a softmax (for training) or argmax (for testing) operation over the classes probabilities. The training of the network is conducted with the traditional backpropagation method; we employ cross-entropy as the loss function and the Adam optimizer.
We employ the macro-F1 and micro-F1 as measures in order to assess the performance of the methods. For each method employing SQ-based features, we compute the statistical significance against its baseline (the same method without SQ-based features); to this aim, we employ the McNemar's paired non-parametric statistical hypothesis test, taking $0.05$ as significance value.

NN architecture

Code

The code is organized as follows int the src directory:

  • NN_models: the directory contains the code to build the Neural Networks tried in the project, one file for each architecture. The one finally used in the project is in the file NN_cnn_deep_ensemble.py.
  • dataset_prep: the directory contains the code to preprocess the various dataset employed in the project. The file NN_dataloader.py prepares the data to be processed for the Neural Network.
  • general: the directory contains: a) helpers.py, with various functions useful for the current project; b) significance.py, with the code for the significance test; c) utils.py, with more comme useful functions; d) visualization.py, with functions for drawing graphics and similar.
  • NN_classification.py: it performs the Neural Networks experiments.
  • SVM_classification.py: it performs the Support Vector Machine experiments.
  • feature_extractor.py: it extract the features for the SVM experiments.
  • main.py

Single object tracking and segmentation.

Single/Multiple Object Tracking and Segmentation Codes and comparison of recent single/multiple object tracking and segmentation. News ๐Ÿ’ฅ AutoMatch is

ZP ZHANG 385 Jan 02, 2023
Process JSON files for neural recording sessions using Medtronic's BrainSense Percept PC neurostimulator

percept_processing This code processes JSON files for streamed neural data using Medtronic's Percept PC neurostimulator with BrainSense Technology for

Maria Olaru 3 Jun 06, 2022
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
A Quick and Dirty Progressive Neural Network written in TensorFlow.

prog_nn .โ–„โ–„ ยท โ–„ยท โ–„โ–Œ โ– โ–„ โ–„โ–„โ–„ยท โ– โ–„ โ–โ–ˆ โ–€. โ–โ–ˆโ–ชโ–ˆโ–ˆโ–Œโ€ขโ–ˆโ–Œโ–โ–ˆโ–โ–ˆ โ–„โ–ˆโ–ช โ€ขโ–ˆโ–Œโ–โ–ˆ โ–„โ–€โ–€โ–€โ–ˆโ–„โ–โ–ˆโ–Œโ–โ–ˆโ–ชโ–โ–ˆโ–โ–โ–Œ โ–ˆโ–ˆโ–€

SynPon 53 Dec 12, 2022
Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation [OpenReview] [arXiv] [Code] The official implementation of GeoDiff: A Geome

Minkai Xu 155 Dec 26, 2022
This repository contains a toolkit for collecting, labeling and tracking object keypoints

This repository contains a toolkit for collecting, labeling and tracking object keypoints. Object keypoints are semantic points in an object's coordinate frame.

ETHZ ASL 13 Dec 12, 2022
Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

Himangi Mittal 50 Nov 22, 2022
Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

Yuming Jiang 151 Dec 26, 2022
Unsupervised Representation Learning via Neural Activation Coding

Neural Activation Coding This repository contains the code for the paper "Unsupervised Representation Learning via Neural Activation Coding" published

yookoon park 5 May 26, 2022
Differentiable Surface Triangulation

Differentiable Surface Triangulation This is our implementation of the paper Differentiable Surface Triangulation that enables optimization for any pe

61 Dec 07, 2022
Revisiting Self-Training for Few-Shot Learning of Language Model.

SFLM This is the implementation of the paper Revisiting Self-Training for Few-Shot Learning of Language Model. SFLM is short for self-training for few

15 Nov 19, 2022
Neural network for digit classification powered by cuda

cuda_nn_mnist Neural network library for digit classification powered by cuda Resources The library was built to work with MNIST dataset. python-mnist

Nikita Ardashev 1 Dec 20, 2021
PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

Zhe Niu 28 Dec 19, 2022
Convert Pytorch model to onnx or tflite, and the converted model can be visualized by Netron

Convert Pytorch model to onnx or tflite, and the converted model can be visualized by Netron

Roxbili 5 Nov 19, 2022
Fedlearnๆ”ฏๆŒๅ‰ๆฒฟ็ฎ—ๆณ•็ ”ๅ‘็š„Pythonๅทฅๅ…ทๅบ“ | Fedlearn algorithm toolkit for researchers

FedLearn-algo Installation Development Environment Checklist python3 (3.6 or 3.7) is required. To configure and check the development environment is c

89 Nov 14, 2022
The mini-MusicNet dataset

mini-MusicNet A music-domain dataset for multi-label classification Music transcription is sequence-to-sequence prediction problem: given an audio per

John Thickstun 4 Nov 09, 2022
KaziText is a tool for modelling common human errors.

KaziText KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatic

รšFAL 3 Nov 24, 2022
A PyTorch version of You Only Look at One-level Feature object detector

PyTorch_YOLOF A PyTorch version of You Only Look at One-level Feature object detector. The input image must be resized to have their shorter side bein

Jianhua Yang 25 Dec 30, 2022
Video-based open-world segmentation

UVO_Challenge Team Alpes_runner Solutions This is an official repo for our UVO Challenge solutions for Image/Video-based open-world segmentation. Our

Yuming Du 84 Dec 22, 2022
Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Chris Donahue 98 Dec 14, 2022