A pre-trained language model for social media text in Spanish

Last update: Dec 29, 2022

Related tags

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

READ THE FULL PAPER Github Repository

RoBERTuito is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. RoBERTuito comes in 3 flavors: cased, uncased, and uncased+deaccented.

We tested RoBERTuito on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as BETO, BERTin and RoBERTa-BNE. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).

model	hate speech	sentiment analysis	emotion analysis	irony detection	score
robertuito-uncased	0.801 ± 0.010	0.707 ± 0.004	0.551 ± 0.011	0.736 ± 0.008	0.699
robertuito-deacc	0.798 ± 0.008	0.702 ± 0.004	0.543 ± 0.015	0.740 ± 0.006	0.696
robertuito-cased	0.790 ± 0.012	0.701 ± 0.012	0.519 ± 0.032	0.719 ± 0.023	0.682
roberta-bne	0.766 ± 0.015	0.669 ± 0.006	0.533 ± 0.011	0.723 ± 0.017	0.673
bertin	0.767 ± 0.005	0.665 ± 0.003	0.518 ± 0.012	0.716 ± 0.008	0.667
beto-cased	0.768 ± 0.012	0.665 ± 0.004	0.521 ± 0.012	0.706 ± 0.007	0.665
beto-uncased	0.757 ± 0.012	0.649 ± 0.005	0.521 ± 0.006	0.702 ± 0.008	0.657

We release the pre-trained models on huggingface model hub:

Usage

IMPORTANT -- READ THIS FIRST

RoBERTuito is not yet fully-integrated into huggingface/transformers. To use it, first install pysentimiento

pip install pysentimiento

and preprocess text using pysentimiento.preprocessing.preprocess_tweet before feeding it into the tokenizer

','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji',''] ">

from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','']

We are working on integrating this preprocessing step into a Tokenizer within transformers library

Development

Installing

We use python==3.7 and poetry to manage dependencies.

pip install poetry
poetry install

Benchmarking

To run benchmarks

python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>

Check RUN_BENCHMARKS for all experiments

Smoke test

Test the benchmark running

./smoke_test.sh

Citation

If you use RoBERTuito, please cite our paper:

@misc{perez2021robertuito,
      title={RoBERTuito: a pre-trained language model for social media text in Spanish},
      author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
      year={2021},
      eprint={2111.09453},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A pre-trained language model for social media text in Spanish

Related tags

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

Usage

Development

Installing

Benchmarking

Smoke test

Citation

Owner

GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

A Python library for unevenly-spaced time series analysis

Aggragrating Nested Transformer Official Jax Implementation

[NeurIPS 2020] Semi-Supervision (Unlabeled Data) & Self-Supervision Improve Class-Imbalanced / Long-Tailed Learning

Official implementation for CVPR 2021 paper: Adaptive Class Suppression Loss for Long-Tail Object Detection

Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Transformer in Vision

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

Automatic learning-rate scheduler

UMich 500-Level Mobile Robotics Course

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Unifying Global-Local Representations in Salient Object Detection with Transformer

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

Lightweight plotting to the terminal. 4x resolution via Unicode.

My implementation of Image Inpainting - A deep learning Inpainting model

Jarvis Project is a basic virtual assistant that uses TensorFlow for learning.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

A PyTorch implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"