TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Overview

What is TunBERT?

People in Tunisia use the Tunisian dialect in their daily communications, in most of their media (TV, radio, songs, etc), and on the internet (social media, forums). Yet, this dialect is not standardized which means there is no unique way for writing and speaking it. Added to that, it has its proper lexicon, phonetics, and morphological structures. The need for a robust language model for the Tunisian dialect has become crucial in order to develop NLP-based applications (translation, information retrieval, sentiment analysis, etc).

BERT (Bidirectional Encoder Representations from Transformers) is a method to pre-train general purpose natural language models in an unsupervised fashion and then fine-tune them on specific downstream tasks with labelled datasets. This method was first implemented by Google and gives state-of-the-art results on many tasks as it's the first deeply bidirectional NLP pre-training system.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA).

What has been released in this repository?

This repository includes the code for fine-tuning TunBERT on the three downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA). This will help the community reproduce our work and collaborate continuously. We also released the two pre-trained new models: TunBERT Pytorch and TunBERT TensorFlow. Finally, we open source the fine-tuning datasets used for Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

About the Pre-trained models

TunBERT Pytorch model is based on BERT’s Pytorch implementation from NVIDIA NeMo. The model was pre-trained using 4 NVIDIA Tesla V100 GPUs on a dataset of 500k Tunisian social media comments written in Arabic letters. The pretrained model consists of 12 layers of self-attention modules. Each module is made with 12 heads of self-attention with 768 hidden-size. Furthermore, an Adam optimizer was used, with a learning rate of 1e-4, a batch size of 128, a maximum sequence length of 128 and a masking probability of 15%. Cosine annealing was used for a learning rate scheduling with a warm-up ratio of 0.01.

Similarly, a second TunBERT TensorFlow model was trained using TensorFlow implementation from Google. We use the same compute power for pre-training this model (4 NVIDIA Tesla V100 GPUs) while keeping the same hyper-parameters: A learning rate of 1e-4, a batch size of 128 and a maximum sequence length of 128.

The two models are available for download through:

For TunBERT PyTorch:

For TunBERT TensorFlow:

About the Finetuning datasets

Tunisian Sentiment Analysis

  • Tunisian Sentiment Analysis Corpus (TSAC) obtained from Facebook comments about popular TV shows. The TSAC dataset contains both Arabic and latin characters. Hence, we used only Arabic comments.

Dataset link: TSAC

Reference : Salima Medhaffar, Fethi Bougares, Yannick Estève and Lamia Hadrich-Belguith. Sentiment analysis of Tunisian dialects: Linguistic Resources and Experiments. WANLP 2017. EACL 2017

  • Tunisian Election Corpus (TEC) obtained from tweets about Tunisian elections in 2014.

Dataset link: TEC

Reference: Karim Sayadi, Marcus Liwicki, Rolf Ingold, Marc Bui. Tunisian Dialect and Modern Standard Arabic Dataset for Sentiment Analysis : Tunisian Election Context, IEEE-CICLing (Computational Linguistics and Intelligent Text Processing) Intl. conference, Konya, Turkey, 7-8 Avril 2016.

Tunisian Dialect Identification

Tunisian Arabic Dialects Identification(TADI): It is a binary classification task consisting of classifying Tunisian dialect and Non Tunisian dialect from an Arabic dialectical dataset.

Tunisian Algerian Dialect(TAD): It is a binary classification task consisting of classifying Tunisian dialect and Algerian dialect from an Arabic dialectical dataset.

The two datasets are available for download for research purposes:

TADI:

TAD:

Reading Comprehension Question-Answering

For this task, we built TRCD (Tunisian Reading Comprehension Dataset) as a Question-Answering dataset for Tunisian dialect. We used a dialectal version of the Tunisian constitution following the guideline in this article. It is composed of 144 documents where each document has exactly 3 paragraphs and three Question-Answer pairs are assigned to each paragraph. Questions were formulated by four Tunisian native speaker annotators and each question should be paired with a paragraph.

We made the dataset publicly available for research purposes:

TRCD:

Install

We use:

  • conda to setup our environment,
  • and python 3.7.9

Setup our environment:

# Clone the repo
git clone https://github.com/instadeepai/tunbert.git
cd tunbert

# Create a conda env
conda env create -f environment_torch.yml #bert-nvidia
conda env create -f environment_tf2.yml #bert-google

# Activate conda env
conda activate tunbert-torch #bert-nvidia
conda activate tf2-google #bert-google

# Install pre-commit hooks
pre-commit install

# Run all pre-commit checks (without committing anything)
pre-commit run --all-files

Project Structure

This is the folder structure of the project:

README.md             # This file :)
.gitlab-ci.yml        # CI with gitlab
.gitlab/              # Gitlab specific 
.pre-commit-config.yml  # The checks to run before every commit
environment_torch.yml       # contains the conda environment definition 
environment_tf2.yml       # contains the conda environment definition for pre-training bert-google
...

dev-data/             # data sample
    sentiment_analysis_tsac/
    dialect_classification_tadi/
    question_answering_trcd/

models/               # contains the different models to used 
    bert-google/
    bert-nvidia/

TunBERT-PyTorch

Fine-tune TunBERT-PyTorch on the Sentiment Analysis (SA) task

To fine-tune TunBERT-PyTorch on the SA task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_SA_DC.py --config-name "sentiment_analysis_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file_path="/path/to/train.tsv" model.validation_ds.file_path="/path/to/valid.tsv" model.test_ds.file_path="/path/to/test.tsv"

Fine-tune TunBERT-PyTorch on the Dialect Classification (DC) task

To fine-tune TunBERT-PyTorch on the DC task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_SA_DC.py --config-name "dialect_classification_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file_path="/path/to/train.tsv" model.validation_ds.file_path="/path/to/valid.tsv" model.test_ds.file_path="/path/to/test.tsv"

Fine-tune TunBERT-PyTorch on the Question Answering (QA) task

To fine-tune TunBERT-PyTorch on the QA task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_QA.py --config-name "question_answering_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file="/path/to/train.json" model.validation_ds.file="/path/to/val.json" model.test_ds.file="/path/to/test.json"

TunBERT-TensorFlow

Fine-tune TunBERT-TensorFlow on the Sentiment Analysis (SA) or Dialect Classification (DC) Task:

To fine-tune TunBERT-TensorFlow for a SA task or, you need to:

  • Specify the BERT_FOLDER_NAME in models/bert-google/finetuning_sa_tdid.sh.

    BERT_FOLDER_NAME should contain the config and vocab files and the checkpoint of your language model

  • Specify the DATA_FOLDER_NAME in models/bert-google/finetuning_sa_tdid.sh

  • Run:

bash models/bert-google/finetuning_sa_tdid.sh

Fine-tune TunBERT-TensorFlow on the Question Answering (QA) Task:

To fine-tune TunBERT-TensorFlow for a QA task, you need to:

  • Specify the BERT_FOLDER_NAME in models/bert-google/finetuning_squad.sh.

    BERT_FOLDER_NAME should contain the config and vocab files and the checkpoint of your language model

  • Specify the DATA_FOLDER_NAME in models/bert-google/finetuning_squad.sh

  • Run:

bash models/bert-google/finetuning_squad.sh

You can view the results, by launching tensorboard from your logging directory.

e.g. tensorboard --logdir=OUTPUT__FOLDER_NAME

Contact information

InstaDeep

iCompass

Owner
InstaDeep Ltd
InstaDeep offers a host of Enterprise AI products, ranging from GPU-accelerated insights to self-learning decision making systems.
InstaDeep Ltd
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Persian-Image-Captioning We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implem

Hamtech-ai 15 Aug 25, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022
Mlcode - Continuous ML API Integrations

mlcode Basic APIs for ML applications. Django REST Application Contains REST API

Sujith S 1 Jan 01, 2022
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations 🇩🇪 DE 🇫🇷 FR 🇭🇺 HU 🇮🇩 ID 🇮🇹 IT 🇳🇱 NL 🇧🇷 PT-BR 🇷🇺 RU 🇨🇳 ZH ➡️ Documentation | Discord | Installation Guide ⬅️ Fully autom

11.2k Jan 05, 2023
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 03, 2023
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022