[ACL 2022] LinkBERT: A Knowledgeable Language Model ๐Ÿ˜Ž Pretrained with Document Links

Overview

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links

PRs Welcome arXiv PWC PWC

This repo provides the model, code & data of our paper: LinkBERT: Pretraining Language Models with Document Links (ACL 2022). [PDF] [HuggingFace Models]

Overview

LinkBERT is a new pretrained language model (improvement of BERT) that captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides using a single document as in BERT.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

1. Pretrained Models

We release the pretrained LinkBERT (-base and -large sizes) for both the general domain and biomedical domain. These models have the same format as the HuggingFace BERT models, and you can easily switch them with LinkBERT models.

Model Size Domain Pretraining Corpus Download Link ( ๐Ÿค— HuggingFace)
LinkBERT-base 110M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-base
LinkBERT-large 340M parameters General Wikipedia with hyperlinks michiyasunaga/LinkBERT-large
BioLinkBERT-base 110M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-base
BioLinkBERT-large 340M parameters Biomedicine PubMed with citation links michiyasunaga/BioLinkBERT-large

To use these models in ๐Ÿค— Transformers:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

To fine-tune the models, see Section 2 & 3 below. When fine-tuned on downstream tasks, LinkBERT achieves the following results.
General benchmarks (MRQA and GLUE):

HotpotQA TriviaQA SearchQA NaturalQ NewsQA SQuAD GLUE
F1 F1 F1 F1 F1 F1 Avg score
BERT-base 76.0 70.3 74.2 76.5 65.7 88.7 79.2
LinkBERT-base 78.2 73.9 76.8 78.3 69.3 90.1 79.6
BERT-large 78.1 73.7 78.3 79.0 70.9 91.1 80.7
LinkBERT-large 80.8 78.2 80.5 81.0 72.6 92.7 81.1

Biomedical benchmarks (BLURB, MedQA, MMLU, etc): BioLinkBERT attains new state-of-the-art ๐Ÿ˜Š

BLURB score PubMedQA BioASQ MedQA-USMLE
PubmedBERT-base 81.10 55.8 87.5 38.1
BioLinkBERT-base 83.39 70.2 91.4 40.0
BioLinkBERT-large 84.30 72.2 94.8 44.6
MMLU-professional medicine
GPT-3 (175 params) 38.7
UnifiedQA (11B params) 43.2
BioLinkBERT-large (340M params) 50.7

2. Set up environment and data

Environment

Run the following commands to create a conda environment:

conda create -n linkbert python=3.8
source activate linkbert
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval

Data

You can download the preprocessed datasets on which we evaluated LinkBERT from [here]. Simply download this zip file and unzip it. This includes:

  • MRQA question answering datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD)
  • BLURB biomedical NLP datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.)
  • MedQA-USMLE biomedical reasoning dataset.
  • MMLU-professional medicine reasoning dataset.

They are all preprocessed in the HuggingFace dataset format.

If you would like to preprocess the raw data from scratch, you can take the following steps:

  • First download the raw datasets from the original sources by following instructions in scripts/download_raw_data.sh
  • Then run the preprocessing scripts scripts/preprocess_{mrqa,blurb,medqa,mmlu}.py.

3. Fine-tune LinkBERT

Change the working directory to src/, and follow the instructions below for each dataset.

MRQA

To fine-tune for the MRQA datasets (HotpotQA, TriviaQA, NaturalQuestions, SearchQA, NewsQA, SQuAD), run commands listed in run_examples_mrqa_linkbert-{base,large}.sh.

BLURB

To fine-tune for the BLURB biomedial datasets (PubMedQA, BioASQ, HoC, Chemprot, PICO, etc.), run commands listed in run_examples_blurb_biolinkbert-{base,large}.sh.

MedQA & MMLU

To fine-tune for the MedQA-USMLE dataset, run commands listed in run_examples_medqa_biolinkbert-{base,large}.sh.

To evaluate the fine-tuned model additionally on MMLU-professional medicine, run the commands listed at the bottom of run_examples_medqa_biolinkbert-large.sh.

Reproducibility

We also provide Codalab worksheet, on which we record our experiments. You may find it useful for replicating the experiments using the same model, code, data, and environment.

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}
Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Segmentation models with pretrained backbones. PyTorch.

Python library with Neural Networks for Image Segmentation based on PyTorch. The main features of this library are: High level API (just two lines to

Pavel Yakubovskiy 6.6k Jan 06, 2023
[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Large Scale Learning on Non-Homophilous Graphs: New Benchmark

60 Jan 03, 2023
Husein pet projects in here!

project-suka-suka Husein pet projects in here! List of projects mysejahtera-density. Generate resolution points using meshgrid and request each points

HUSEIN ZOLKEPLI 47 Dec 09, 2022
A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching.

LPM_Python A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching. The code is established ac

AoxiangFan 11 Nov 07, 2022
[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

MAED: Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation Getting Started Our codes are implemented and tested with pyth

ZiNiU WaN 176 Dec 15, 2022
Official PyTorch Implementation of Convolutional Hough Matching Networks, CVPR 2021 (oral)

Convolutional Hough Matching Networks This is the implementation of the paper "Convolutional Hough Matching Network" by J. Min and M. Cho. Implemented

Juhong Min 70 Nov 22, 2022
Deduplicating Training Data Makes Language Models Better

Deduplicating Training Data Makes Language Models Better This repository contains code to deduplicate language model datasets as descrbed in the paper

Google Research 431 Dec 27, 2022
Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening

Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening Introduction This is an implementation of the model used for breast

757 Dec 30, 2022
Automatic tool focused on deriving metallicities of open clusters

metalcode Automatic tool focused on deriving metallicities of open clusters. Based on the method described in Pรถhnl & Paunzen (2010, https://ui.adsabs

2 Dec 13, 2021
Repository for training material for the 2022 SDSC HPC/CI User Training Course

hpc-training-2022 Repository for training material for the 2022 SDSC HPC/CI Training Series HPC/CI Training Series home https://www.sdsc.edu/event_ite

sdsc-hpc-training-org 21 Jul 27, 2022
Learning Super-Features for Image Retrieval

Learning Super-Features for Image Retrieval This repository contains the code for running our FIRe model presented in our ICLR'22 paper: @inproceeding

NAVER 101 Dec 28, 2022
Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

โฑ pytorch-benchmark Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption Install pip install pytor

Lukas Hedegaard 21 Dec 22, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
PyTorch implementation of "Dataset Knowledge Transfer for Class-Incremental Learning Without Memory" (WACV2022)

Dataset Knowledge Transfer for Class-Incremental Learning Without Memory [Paper] [Slides] Summary Introduction Installation Reproducing results Citati

Habib Slim 5 Dec 05, 2022
JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces JAXMAPP is a JAX-based library for multi-agent path planning (MAPP) in c

OMRON SINIC X 24 Dec 28, 2022
Deep learning (neural network) based remote photoplethysmography: how to extract pulse signal from video using deep learning tools

Deep-rPPG: Camera-based pulse estimation using deep learning tools Deep learning (neural network) based remote photoplethysmography: how to extract pu

Terbe Dรกniel 138 Dec 17, 2022
PyTorch implementation for Graph Contrastive Learning with Augmentations

Graph Contrastive Learning with Augmentations PyTorch implementation for Graph Contrastive Learning with Augmentations [poster] [appendix] Yuning You*

Shen Lab at Texas A&M University 382 Dec 15, 2022
Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021) authors: Boris Knyazev, Michal Drozdzal, Graham Taylor, Adriana Romero-Soriano Overv

Facebook Research 462 Jan 03, 2023
Classifying audio using Wavelet transform and deep learning

Audio Classification using Wavelet Transform and Deep Learning A step-by-step tutorial to classify audio signals using continuous wavelet transform (C

Aditya Dutt 17 Nov 29, 2022