SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Last update: Nov 09, 2022

Overview

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Chen, Jia, et al. "Axiomatically Regularized Pre-training for Ad hoc Search." To Appear in the Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.

Requirements

python 3.7
torch==1.9.0
transformers==4.9.2
tqdm, nltk, numpy, boto3
trec_eval for evaluation on TREC DL 2019
anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARES_simple and Transformer_ICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARES_simple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

doc_token_ids.memmap: each line is the token ids for a document
docid2idx.json: {docid: memmap_line_id}

Query files:

queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
dev-qrels.txt: qid relevant_docid for validating set
2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
sample_qid2id.json: {qid: memmap_line_id}
axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name	MS MARCO [email protected]	MS MARCO [email protected]	DL [email protected]	DL [email protected]	COVID	EQ
BM25	0.2962	0.3107	0.5776	0.4795	0.4857	0.6690
BERT	0.1820	0.2012	0.4059	0.4198	0.4314	0.6055
PROP_wiki	0.2429	0.2596	0.5088	0.4525	0.4857	0.5991
PROP_marco	0.2763	0.2914	0.5317	0.4623	0.4829	0.6454
ARES_strict	0.2630	0.2785	0.4942	0.4504	0.4786	0.6923
ARES_hard	0.2627	0.2780	0.5189	0.4613	0.4943	0.6822
ARES_simple	0.2991	0.3130	0.5955	0.4863	0.4957	0.6916

Few-shot performance:

Visualization (attribution values have been normalized within a document):

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
We welcome anyone who would like to contribute to this repo. 🤗
If you have any other questions, please feel free to contact me via [email protected] or open an issue.
Code for data preprocessing will come soon. Please stay tuned~

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Related tags

Overview

Introduction

Requirements

Why this repo?

Pre-training Data

Download data

Pre-process data

Quick Start

Pre-training

Zero-shot evaluation (based on AS top100)

Fine-tuning

Visualization

Results

Citation

Notice

Owner

Jia Chen

DeepPavlov Tutorials

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Let Xiao Ai speakers control third-party devices

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

Speech Recognition Database Management with python

Community and sentiment analysis based on tweets

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

NLP library designed for reproducible experimentation management

Natural language Understanding Toolkit

Labelling platform for text using distant supervision

A PyTorch implementation of VIOLET

ASCEND Chinese-English code-switching dataset

Yet Another Compiler Visualizer

This repository is home to the Optimus data transformation plugins for various data processing needs.

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"