Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Nov 14, 2022

Overview

Spanish legal domain Language Model ⚖️

This repository contains the page for two main resources for the Spanish legal domain:

A RoBERTa model: https://huggingface.co/PlanTL-GOB-ES/RoBERTalex
FastText embeddings: https://zenodo.org/record/5036147
Legal corpora: https://zenodo.org/record/5495529

The repository and the pre-print will be updated with larger models, evaluations, etcetera.

Why ❓

There are few models trained for the Spanish language. Some of the models have been trained with a low resource, unclean corpora. The ones derived from the Spanish National Plan for Language Technologies are proficient solving several tasks and have been trained using large scale clean corpora. However, the Spanish Legal domain language could be think of an independent language on its own. We therefore created a Spanish Legal model from scratch trained exclusively on legal corpora.

Evaluation ✅

Work in progress.

Corpora 📃

Corpus name	Size (GB)	Tokens (M)
Procesos Penales	0.625	0.119
JRC Acquis	0.345	59.359
Códigos Electrónicos Universitarios	0.077	11.835
Códigos Electrónicos	0.080	12.237
Doctrina de la Fiscalía General del Estado	0.017	2.669
Legislación BOE	3.600	578.685
Abogacía del Estado BOE	0.037	6.123
Consejo de Estado: Dictámenes	0.827	135.348
Spanish EURLEX	0.001	0.072
UN Resolutions	0.023	3.539
Spanish DOGC	0.826	132.569
Spanish MultiUN	2.200	352.653
Consultas Tributarias Generales y Vinculantes	0.466	77.691
Constitución Española	0.002	0.018
COPPA Patents Corpus	0.002	-
Biomedical Patents	0.083	-

Usage example ⚗️

You can train your model for different downstream tasks using the scripts that Hugging Face provides (Name Entity Recognition, GLUE tasks and others)

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Cite 📣

If this work is helpful, please cite it:

@misc{gutierrezfandino2021legal,
      title={Spanish Legalese Language Model and Corpora}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2021},
      eprint={2110.12201},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) evaluate/train the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish legal domain Language Model ⚖️

Why ❓

Evaluation ✅

Corpora 📃

Usage example ⚗️

Cite 📣

Contact 📧

Owner

Plan de Tecnologías del Lenguaje - Gobierno de España

Code for "Long Range Probabilistic Forecasting in Time-Series using High Order Statistics"

This repository contains the exercises and its solution contained in the book "An Introduction to Statistical Learning" in python.

Framework for Spectral Clustering on the Sparse Coefficients of Learned Dictionaries

Implements the training, testing and editing tools for "Pluralistic Image Completion"

Autolfads-tf2 - A TensorFlow 2.0 implementation of Latent Factor Analysis via Dynamical Systems (LFADS) and AutoLFADS

A check for whether the dependency jobs are all green.

Supporting code for short YouTube series Neural Networks Demystified.

Online Multi-Granularity Distillation for GAN Compression (ICCV2021)

In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

Company clustering with K-means/GMM and visualization with PCA, t-SNE, using SSAN relation extraction

The GitHub repository for the paper: “Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction“.

Auditing Black-Box Prediction Models for Data Minimization Compliance

NeRViS: Neural Re-rendering for Full-frame Video Stabilization

Open source Python module for computer vision

Faster RCNN with PyTorch

Hyperparameters tuning and features selection are two common steps in every machine learning pipeline.

DFM: A Performance Baseline for Deep Feature Matching

A curated list of references for MLOps

[NeurIPS 2020] Blind Video Temporal Consistency via Deep Video Prior

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.