Masader
The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/
Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
https://arxiv.org/abs/2110.06744Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*
Metadata
No.dataset numberNamename of the datasetSubsetssubsets of the datasetsLinkdirect link to the dataset or instructions on how to download itLicenselicense of the datasetYearyear of the publishing the dataset/paperLanguagear or multilingualDialectregion ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))Domainsocial media, news articles, reviews, commentary, books, transcribed audio or otherFormtext, audio or sign languageCollection stylecrawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or otherDescriptionshort statement describing the datasetVolumethe size of the dataset in numbersUnitunit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or otherProvidercompany or university providing the datasetRelated Datasetsany datasets that is related in terms of content to the datasetPaper Titletitle of the paperPaper Linkdirect link to the paper pdfScriptwriting system either Arab, Latn, Arab-Latn or otherTokenizedwhether the dataset is segmented using morphology: Yes or NoHostthe host website for the data i.e GitHubAccessis the data free, upon-request or with-fee.Costcost of the data is with-fee.Test splitdoes the data contain test split: Yes or NoTasksthe tasks included in the dataset spearated by commaEvaluation Setis the data included in the evaluation suit by BigScienceVenue Titlethe venue title i.e ACLCitationsthe number of citationsVenue Typeconference, workshop, journal or preprintVenue Namefull name of the venue i.e Associations of computation linguisticsauthorslist of the paper authors separated by commaaffiliationslist of the paper authors' affiliations separated by commaabstractabstract of the paperAdded byname of the person who added the entryNotesany extra notes on the dataset
Contribution
If you want to add a new dataset feel free to update the sheet. Please follow the instructions there for adding the entry.
Citation
@misc{alyafeai2021masader,
title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources},
author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
year={2021},
eprint={2110.06744},
archivePrefix={arXiv},
primaryClass={cs.CL}
}