darija <-> english dictionary

Last update: Jan 01, 2023

Related tags

Overview

darija-dictionary

Having advanced IT solutions that are well adapted to the Moroccan context passes inevitably through understanding Moroccan dialect. Hence, darija (Moroccan dialect) should be an active player in the domain of Natural Language Processing (NLP).

However, it turns out that step 0 in any serious engagement with darija in NLP will consist of translating its vocabulary to the widely used and most documented language in this field, namely English.

This open source project aims to be a reference in addressing this issue. We hope for the contribution of the Moroccan IT community in order to build up the largest dataset of darija-english vocabulary which will serve as a pedestal for any future application of NLP to benefit Moroccan people.

How to contribute

We've made a tutorial for you in DODa's website

Guidelines / Recommendations

3ndk ح dir ح xD (shout-out to this guy 😆 ), often try to use:

darija	3	7	9	8	2 - 'a' - 'i'	5 - 'kh'
arabic	ع	ح	ق	ه	همزة	خ

Try to use capitalization to differentiate between the following letters:

t	T	s	S	d	D
ت	ط	س	ص	د	ض

Arabic characters with two-letters Latin equivalent:

Arabic alphabet	ش	غ	خ
Latin alphabet	ch	gh	kh

Double characters to refer to the emphasis or "الشدة":

darija	7mam	7mmam
english	pigeons	bathroom

We usually don't add "e" in the end of darija words : louz instead of louze
We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)
We do NOT use apostrophes. In fact, since we are working on csv files, apostrophes will break off words
We use spaces as word delimiters, not _ nor - : thank you instead of thank_you
Respect the number of columns in every row you add, you can use empty quotation marks "" in case you don't have extra variations
In every row, always start with the most used form (in your opinion of course) of the word in question
For future use of this dataset to train deep neural networks, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

"sou9","souk","souq","market"

"marchi","","","market"

verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

"ghnna","ghenna","ghanna","","","","sing"

masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

Citation

@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

darija <-> english dictionary

Related tags

Overview

darija-dictionary

How to contribute

Guidelines / Recommendations

Citation

Owner

DODa

This is the research repository for Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition.

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

The official repository for "Revealing unforeseen diagnostic image features with deep learning by detecting cardiovascular diseases from apical four-chamber ultrasounds"

Use Python, OpenCV, and MediaPipe to control a keyboard with facial gestures

[ICCV'2021] "SSH: A Self-Supervised Framework for Image Harmonization", Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, Zhangyang Wang

WTTE-RNN a framework for churn and time to event prediction

A general 3D Object Detection codebase in PyTorch.

An open source bike computer based on Raspberry Pi Zero (W, WH) with GPS and ANT+. Including offline map and navigation.

This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Dialect classification

PyTorch implementation for STIN

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Deep Inside Convolutional Networks - This is a caffe implementation to visualize the learnt model

Train a deep learning net with OpenStreetMap features and satellite imagery.

DIRL: Domain-Invariant Representation Learning

Object classification with basic computer vision techniques

Anonymous implementation of KSL

A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

A supplementary code for Editable Neural Networks, an ICLR 2020 submission.

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.