Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

A Transformer Implementation that is easy to understand and customizable.

Input english text, then translate it between languages n times using the Deep Translator Python Library.

Implementation of legal QA system based on SentenceKoBART

Simple Annotated implementation of GPT-NeoX in PyTorch

Write Alphabet, Words and Sentences with your eyes.

Fast, DB Backed pretrained word embeddings for natural language processing.

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Open Source Neural Machine Translation in PyTorch

PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

txtai: Build AI-powered semantic search applications in Go

Sapiens is a human antibody language model based on BERT.

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

A python package for deep multilingual punctuation prediction.

Experiments in converting wikidata to ftm

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Python library for interactive topic model visualization. Port of the R LDAvis package.

Natural Language Processing Specialization

Text vectorization tool to outperform TFIDF for classification tasks