使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Overview

Pretrain_Bert_with_MaskLM

Info

使用Mask LM预训练任务来预训练Bert模型。

基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。

Pretraining Task

Mask Language Model,简称Mask LM,即基于Mask机制的预训练语言模型。

同时支持 原生的MaskLM任务和Whole Words Masking任务。默认使用Whole Words Masking

MaskLM

使用来自于Bert的mask机制,即对于每一个句子中的词(token):

  • 85%的概率,保留原词不变
  • 15%的概率,使用以下方式替换
    • 80%的概率,使用字符[MASK],替换当前token。
    • 10%的概率,使用词表随机抽取的token,替换当前token。
    • 10%的概率,保留原词不变。

Whole Words Masking

与MaskLM类似,但是在mask的步骤有些少不同。

在Bert类模型中,考虑到如果单独使用整个词作为词表的话,那词表就太大了。不利于模型对同类词的不同变种的特征学习,故采用了WordPiece的方式进行分词。

Whole Words Masking的方法在于,在进行mask操作时,对象变为分词前的整个词,而非子词。

Model

使用原生的Bert模型作为基准模型。

Datasets

项目里的数据集来自wikitext,分成两个文件训练集(train.txt)和测试集(test.txt)。

数据以行为单位存储。

若想要替换成自己的数据集,可以使用自己的数据集进行替换。(注意:如果是预训练中文模型,需要修改配置文件Config.py中的self.initial_pretrain_modelself.initial_pretrain_tokenizer,将值修改成 bert-base-chinese

自己的数据集不需要做mask机制处理,代码会处理。

Training Target

本项目目的在于基于现有的预训练模型参数,如google开源的bert-base-uncasedbert-base-chinese等,在垂直领域的数据语料上,再次进行预训练任务,由此提升bert的模型表征能力,换句话说,也就是提升下游任务的表现。

Environment

项目主要使用了Huggingface的datasetstransformers模块,支持CPU、单卡单机、单机多卡三种模式。

可通过以下命令安装依赖包

    pip install -r requirement.txt

主要包含的模块如下:

    python3.6
    torch==1.3.0
    tqdm==4.61.2
    transformers==4.6.1
    datasets==1.10.2
    numpy==1.19.5
    pandas==1.1.3

Get Start

单卡模式

直接运行以下命令

    python train.py

或修改Config.py文件中的变量self.cuda_visible_devices为单卡后,运行

    chmod 755 run.sh
    ./run.sh

多卡模式

如果你足够幸运,拥有了多张GPU卡,那么恭喜你,你可以进入起飞模式。 🚀 🚀

(1)使用torch的nn.parallel.DistributedDataParallel模块进行多卡训练。其中config.py文件中参数如下,默认可以不用修改。

  • self.cuda_visible_devices表示程序可见的GPU卡号,示例:1,2→可在GPU卡号为1和2上跑,亦可以改多张,如0,1,2,3
  • self.device在单卡模式,表示程序运行的卡号;在多卡模式下,表示master的主卡,默认会变成你指定卡号的第一张卡。若只有cpu,那么可修改为cpu
  • self.port表示多卡模式下,进程通信占用的端口号。(无需修改)
  • self.init_method表示多卡模式下进程的通讯地址。(无需修改)
  • self.world_size表示启动的进程数量(无需修改)。在torch==1.3.0版本下,只需指定一个进程。在1.9.0以上,需要与GPU数量相同。

(2)运行程序启动命令

    chmod 755 run.sh
    ./run.sh

Experiment

使用交叉熵(cross-entropy)作为损失函数,困惑度(perplexity)和Loss作为评价指标来进行训练,训练过程如下:

Reference

【Bert】https://arxiv.org/pdf/1810.04805.pdf

【transformers】https://github.com/huggingface/transformers

【datasets】https://huggingface.co/docs/datasets/quicktour.html

Owner
Desmond Ng
NLP Engineer
Desmond Ng
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Pre-Training with Whole Word Masking for Chinese BERT

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui 7.7k Dec 31, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022