Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Last update: Sep 08, 2022

Related tags

Overview

OCR Ground Truth for Historical Commentaries

The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public domain subset of scholarly commentaries on Sophocles' Ajax. Its main goal is to enable the evaluation of the OCR quality on printed materials that contain a mix of Latin and polytonic Greek scripts. It consists of five 19C commentaries written in German, English, and Latin, for a total of 3,356 GT lines.

Data

GT4HistComment are contained in data/, where each sub-folder corresponds to a different publication (i.e. commentary). For each each commentary we provide the following data:

<commentary_id>/GT-pairs: pairs of image/text files for each GT line
<commentary_id>/imgs: original images on which the OCR was performed
<commentary_id>/<commentary_id>_olr.tsv: OLR annotations with image region coordinates and layout type ground truth label

The OCR output produced by the Kraken + Ciaconna pipeline was manually corrected by a pool of annotators using the Lace platform. In order to ensure the quality of the ground truth datasets, an additional verification of all transcriptions made in Lace was carried out by an annotator on line-by-line pairs of image and corresponding text.

Commentary overview

ID	Commentator	Year	Languages	Image source
bsb10234118	Lobeck [1]	1835	Greek, Latin	BSB
sophokle1v3soph	Schneidewin [2]	1853	Greek, German	Internet Archive
cu31924087948174	Campbell [3]	1881	Greek, English	Internet Archive
sophoclesplaysa05campgoog	Jebb [4]	1896	Greek, English	Internet Archive
Wecklein1894	Wecklein [5]	1894 [5]	Greek. German	internal

Stats

Line, word and char counts for each commentary are indicated in the following table. Detailled counts for each region can be found here.

ID	Commentator	Type	lines	words	all chars	greek chars
bsb10234118	Lobeck	training	574	2943	16081	5344
bsb10234118	Lobeck	groundtruth	202	1491	7917	2786
sophokle1v3soph	Schneidewin	training	583	2970	16112	3269
sophokle1v3soph	Schneidewin	groundtruth	382	1599	8436	2191
cu31924087948174	Campbell	groundtruth	464	2987	14291	3566
sophoclesplaysa05campgoog	Jebb	training	561	4102	19141	5314
sophoclesplaysa05campgoog	Jebb	groundtruth	324	2418	10986	2805
Wecklein1894	Wecklein	groundtruth	211	1912	9556	3268

Commentary editions used:

[1] Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann.
[2] Sophokles. 1853. Sophokles Erklaert von F. W. Schneidewin. Erstes Baendchen: Aias. Philoktetes. Edited by Friedrich Wilhelm Schneidewin. Leipzig: Weidmann.
[3] Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press.
[4] Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer.
[5] Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.

Citation

If you use this dataset in your research, please cite the following publication:

@inproceedings{romanello_optical_2021,
  title = {Optical {{Character Recognition}} of 19th {{Century Classical Commentaries}}: The {{Current State}} of {{Affairs}}},
  booktitle = {The 6th {{International Workshop}} on {{Historical Document Imaging}} and {{Processing}} ({{HIP}} '21)},
  author = {Romanello, Matteo and Sven, Najem-Meyer and Robertson, Bruce},
  year = {2021},
  publisher = {{Association for Computing Machinery}},
  address = {{Lausanne}},
  doi = {10.1145/3476887.3476911}
}

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL), Bruce Robertson (Mount Allison University).

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

305 Dec 19, 2022

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints Official implementation for Reducing Footskate in Human Motion Recon

38 Nov 1, 2022

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

368 Dec 6, 2022

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

GndNet: Fast Ground plane Estimation and Point Cloud Segmentation for Autonomous Vehicles. Authors: Anshul Paigwar, Ozgur Erkent, David Sierra Gonzale

114 Dec 29, 2022

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python THIS PROJECT IS CURRENTLY A WORK IN PROGRESS AND THUS THIS REPOSITORY I

14 Dec 31, 2022

Using LSTM to detect spoofing attacks in an Air-Ground network

Using LSTM to detect spoofing attacks in an Air-Ground network Specifications IDE: Spider Packages: Tensorflow 2.1.0 Keras NumPy Scikit-learn Matplotl

1 Nov 20, 2021

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

ObjectDrawer-ToolBox is a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system, Object Drawer.

77 Jan 5, 2023

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

139 Dec 25, 2022

A two-stage U-Net for high-fidelity denoising of historical recordings

A two-stage U-Net for high-fidelity denoising of historical recordings Official repository of the paper (not submitted yet): E. Moliner and V. Välimäk

57 Jan 5, 2023

Comments

adds line-, word- and char-counts to README.md

Adds a table to README.md as suggested by reviewer 1. The table also link to a more complete table, itself a public version of spreadsheet OCR evaluation and stats!detailed_counts. Note that the publishable version is an external reference to our private version, meaning that actualising the latter will also update the former.

opened by sven-nm 0
Pages à exclure - OCR

La page contient les schémas métriques des passages. De ce fait l'OCR ne les reconnaît pas, de plus la correction de l'OCR n'a pas été achevée.

Voici les pages à exclure : sophoclesplaysa05campgoog_0072.png (Jebb, p. 72)

opened by camaya28 0

Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

Related tags

Overview

OCR Ground Truth for Historical Commentaries

Data

Commentary overview

Stats

Commentary editions used:

Citation

Acknowledgements

You might also like...

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

Autonomous Ground Vehicle Navigation and Control Simulation Examples in Python

Using LSTM to detect spoofing attacks in an Air-Ground network

ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

A two-stage U-Net for high-fidelity denoising of historical recordings

Comments

adds line-, word- and char-counts to README.md

Pages à exclure - OCR

Releases(v1.0)

v1.0(Sep 24, 2021)

Owner

Ajax Multi-Commentary

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Libraries, tools and tasks created and used at DeepMind Robotics.

Keras implementation of Deeplab v3+ with pretrained weights

Realtime Face Anti Spoofing with Face Detector based on Deep Learning using Tensorflow/Keras and OpenCV

Jarvis Project is a basic virtual assistant that uses TensorFlow for learning.

A fast poisson image editing implementation that can utilize multi-core CPU or GPU to handle a high-resolution image input.

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Tooling for the Common Objects In 3D dataset.

The Power of Scale for Parameter-Efficient Prompt Tuning

The Official PyTorch Implementation of DiscoBox.

A Pytorch loader for MVTecAD dataset.

Train neural network for semantic segmentation (deep lab V3) with pytorch in less then 50 lines of code

[UNMAINTAINED] Automated machine learning for analytics & production

[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models

Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

[CVPR2021] De-rendering the World's Revolutionary Artefacts

Code for Emergent Translation in Multi-Agent Communication