A dataset for online Arabic calligraphy

Last update: Dec 28, 2022

Overview

Calliar

Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic calligraphy. This repository contains the dataset for the following paper :

Calliar: An Online Handwritten Dataset for Arabic Calligraphy
Zaid Alyafeai, Maged S. Al-shaibani, Mustafa Ghaleb, Yousif Ahmed Al-Wajih
https://arxiv.org/abs/2106.10745

Abstract: Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.

Stats

Dataset	# of Samples	# of Words	# of Chars	# of Strokes
Train	2,000	6,065	24,722	36,561
Valid	250	738	2,946	4,410
Test	250	753	3,052	4,601

Dataset Formats

Mainly, we have two basic formats.

.json

Each .json file contains a list of strokes. Each list is a dictionary of the stroke character and the list of points. Each composite character like ت is mapped into a list of primitive strokes i.e ..ٮ . Refer to the paper and to chars.py for more details on the mapping.

.npz

The compressed format of the dataset dataset.npz is only 8.6 MB and uses the Ramer-Douglas-Peucker Algorithm to decrease the number of points per stroke. The python library rdp was used for such task. The .npz format follows the same approach as QuickDraw.

Visualization

The vis.py file contains a list of python methods for easily visualizing the dataset. Here are two examples for drawing a sample json file and creating an animation.

import glob
import matplotlib.pyplot as plt 
import json 
from IPython.core.display import display, HTML, Video
from vis import *

## show an image of the strokes 
drawing = json.load(open(json_path))
print(get_annotation(json_path))
data = convert_3d(drawing)
draw_strokes(data, stroke_width = 2, crop = True)

## create an animation. 
create_animation(json_path)
Video("tmp/video.mp4")

Samples

Animation

video_twitter.mp4

video_twitter_1.mp4

video_twitter_2.mp4

video_twitter_3.mp4

Citation

@misc{alyafeai2021calliar,
      title={Calliar: An Online Handwritten Dataset for Arabic Calligraphy}, 
      author={Zaid Alyafeai and Maged S. Al-shaibani and Mustafa Ghaleb and Yousif Ahmed Al-Wajih},
      year={2021},
      eprint={2106.10745},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A dataset for online Arabic calligraphy

Related tags

Overview

Calliar

Stats

Dataset Formats

.json

.npz

Visualization

Samples

Animation

Citation

Owner

ARBML

Madanalysis5 - A package for event file analysis and recasting of LHC results

Learned image compression

SymPy-powered, Wolfram|Alpha-like answer engine totally in your browser, without backend computation

Text Summarization - WCN — Weighted Contextual N-gram method for evaluation of Text Summarization

[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution

CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices.

Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Offical implementation for "Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation".

Aalto-cs-msc-theses - Listing of M.Sc. Theses of the Department of Computer Science at Aalto University

A Quick and Dirty Progressive Neural Network written in TensorFlow.

Seq2seq - Sequence to Sequence Learning with Keras

Baseline for the Spoofing-aware Speaker Verification Challenge 2022

The official repo for CVPR2021——ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search.

Official PyTorch implementation of "Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics".

The repo of Feedback Networks, CVPR17

STBP is a way to train SNN with datasets by Backward propagation.

Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

A fast implementation of bss_eval metrics for blind source separation

Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework