Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Dec 20, 2022

Overview

Spanish Language Models 💃🏻

Corpora 📃

Corpora	Number of documents	Size (GB)
BNE	201,080,084	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/BSC-TeMU/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/BSC-TeMU/roberta-large-bne
Other models: (WIP)

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO	mBERT	BERTIN
UD-POS	F1	0.9907	0.9901	0.9900	0.9886	0.9904
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8627
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9826
Capitel-NER	F1	0.8959	0.8998	0.8771	0.8810	0.8741
STS	Combined	0.8423	0.8420	0.8216	0.8249	0.7822
MLDoc	Accuracy	0.9595	0.9600	0.9650	0.9560	0.9673
PAWS-X	F1	0.9035	0.9000	0.8915	0.9020	0.8820
XNLI	Accuracy	0.8016	WiP	0.8130	0.7876	WiP

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Legal Language Model

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Word embeddings 🔤

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

Owner

PlanTL-SANIDAD

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Sequence-to-Sequence learning using PyTorch

translate using your voice

🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

100+ Chinese Word Vectors 上百种预训练中文词向量

Text Normalization（文本正则化）

Python powered crossword generator with database with 20k+ polish words

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Unsupervised Language Modeling at scale for robust sentiment classification

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Chinese NER with albert/electra or other bert descendable model (keras)

Text-to-Speech for Belarusian language

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

Text classification on IMDB dataset using Keras and Bi-LSTM network

Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours