NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Last update: Apr 07, 2022

Related tags

Text Data & NLP pretrain4ir_tutorial

Overview

pretrain4ir_tutorial

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

用作NLPIR实验室, Pre-training for IR方向入门.

代码包括了如下部分:

tasks/ : 生成预训练数据
pretrain/: 在生成的数据上Pre-training (MLM + NSP)
finetune/: Fine-tuning on MS MARCO

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone [email protected]:zhengyima/pretrain4ir_tutorial.git pretrain4ir_tutorial
  cd pretrain4ir_tutorial
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

生成预训练数据

代码库提供了最简单易懂的预训练任务 rand。该任务随机从文档中选取1~5个词作为query, 用来demo面向IR的预训练。

生成rand预训练任务数据命令: cd tasks/rand && bash gen.sh

你可以自己编写脚本, 仿照rand任务, 生成你自己认为合理的预训练任务的数据。

Notes: 运行rand任务的shell之前, 你需要先将 gen.sh 脚本中的 msmarco_docs_path 参数改为MSMARCO数据集的文档tsv 路径; 将bert_model参数改为下载好的bert模型目录;

模型预训练

代码库提供了模型预训练的相关代码, 见pretrain。该代码完成了MLM+NSP两个任务的预训练。

模型预训练命令: cd pretrain && bash train_bert.sh

Notes: 注意要修改train_bert中的相应参数：将bert_model参数改为下载好的bert模型目录; train_file改为你上一步生成好的预训练数据文件路径。

模型Fine-tune

代码库提供了在MSMARCO Document Ranking任务上进行Fine-tune的相关代码。见finetune。该代码完成了在MSMARCO上通过point-wise进行fine-tune的流程。

模型fine-tune命令: cd finetune && bash train_bert.sh

Leaderboard

Tasks	[email protected] on dev set
PROP-MARCO	0.4201
PROP-WIKI	0.4188
BERT-Base	0.4184
rand	0.4123

Homework

设计一个你认为合理的预训练任务, 并对BERT模型进行预训练, 并在MSMARCO上完成fine-tune, 在Leaderboard上更新你在dev set上的结果。

你需要做的是:

编写你自己的预训练数据生成脚本, 放到 tasks/yourtask 目录下。
使用以上脚本, 生成自己的预训练数据。
运行代码库提供的pre-train与fine-tune脚本, 跑出结果, 更新Leaderboard。

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Related tags

Overview

pretrain4ir_tutorial

Preinstallation

生成预训练数据

模型预训练

模型Fine-tune

Leaderboard

Homework

Links

Owner

ZYMa

A raytrace framework using taichi language

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

An open-source NLP research library, built on PyTorch.

Python utility library for compositing PDF documents with reportlab.

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Facilitating the design, comparison and sharing of deep text matching models.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Meta learning algorithms to train cross-lingual NLI (multi-task) models

BERT-based Financial Question Answering System

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

null

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

SummerTime - Text Summarization Toolkit for Non-experts