An Open-Source Package for Information Retrieval.

Overview

OpenMatch

An Open-Source Package for Information Retrieval.

😃 What's New

  • Top Spot on TREC-COVID Challenge (May 2020, Round2)

    The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.
    >> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper

Overview

OpenMatch integrates excellent neural methods and technologies to provide a complete solution for deep text matching and understanding. The documentation and tutorial of OpenMatch are available at here.

1/ Document Retrieval

Document Retrieval refers to extracting a set of related documents from large-scale document-level data based on user queries.

* Sparse Retrieval

Sparse Retriever is defined as a sparse bag-of-words retrieval model.

* Dense Retrieval

Dense Retriever performs retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query

2/ Document Reranking

Document reranking aims to further match user query and documents retrieved by the previous step with the purpose of obtaining a ranked list of relevant documents.

* Neural Ranker

Neural Ranker uses neural network as ranker to reorder documents.

* Feature Ensemble

Feature Ensemble can fuse neural features learned by neural ranker with the features of non-neural methods to obtain more robust performance

3/ Domain Transfer Learning

Domain Transfer Learning can leverages external knowledge graphs or weak supervision data to guide and help ranker to overcome data scarcity.

* Knowledge Enhancemnet

Knowledge Enhancement incorporates entity semantics of external knowledge graphs to enhance neural ranker.

* Data Augmentation

Data Augmentation leverages weak supervision data to improve the ranking accuracy in certain areas that lacks large scale relevance labels.

Stage Model Paper
1/ Sparse Retrieval BM25 Best Match25 ~Tool
1/ Dense Retrieval ANN Approximate nearest neighbor ~Tool
2/ Neural Ranker K-NRM End-to-End Neural Ad-hoc Ranking with Kernel Pooling ~Paper
2/ Neural Ranker Conv-KNRM Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search ~Paper
2/ Neural Ranker TK Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking ~Paper
2/ Neural Ranker BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ~Paper
2/ Feature Ensemble Coordinate Ascent Linear feature-based models for information retrieval. Information Retrieval ~Paper
3/ Knowledge Enhancement EDRM Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval ~Paper
3/ Data Augmentation ReInfoSelect Selective Weak Supervision for Neural Information Retrieval ~Paper

Note that the BERT model is following huggingface's implementation - transformers, so other bert-like models are also available in our toolkit, e.g. electra, scibert.

Installation

* From PyPI

pip install git+https://github.com/thunlp/OpenMatch.git

* From Source

git clone https://github.com/thunlp/OpenMatch.git
cd OpenMatch
python setup.py install

* From Docker

To build an OpenMatch docker image from Dockerfile

docker build -t <image_name> .

To run your docker image just built above as a container

docker run --gpus all --name=<container_name> -it -v /:/all/ --rm <image_name>:<TAG>

Quick Start

* Detailed examples are available here.

import torch
import OpenMatch as om

query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."

* For bert-like models:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
input_ids = tokenizer.encode(query, doc)
model = om.models.Bert("allenai/scibert_scivocab_uncased")
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))

* For other models:

tokenizer = om.data.tokenizers.WordTokenizer(pretrained="./data/glove.6B.300d.txt")
query_ids, query_masks = tokenizer.process(query, max_len=16)
doc_ids, doc_masks = tokenizer.process(doc, max_len=128)
model = om.models.KNRM(vocab_size=tokenizer.get_vocab_size(),
                       embed_dim=tokenizer.get_embed_dim(),
                       embed_matrix=tokenizer.get_embed_matrix())
ranking_score, ranking_features = model(torch.tensor(query_ids).unsqueeze(0),
                                        torch.tensor(query_masks).unsqueeze(0),
                                        torch.tensor(doc_ids).unsqueeze(0),
                                        torch.tensor(doc_masks).unsqueeze(0))

* The GloVe can be downloaded using:

wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data
unzip ./data/glove.6B.zip -d ./data

* Evaluation

metric = om.Metric()
res = metric.get_metric(qrels, ranking_list, 'ndcg_cut_20')
res = metric.get_mrr(qrels, ranking_list, 'mrr_cut_10')

Experiments

* Ad-hoc Search

Retriever Reranker Coor-Ascent ClueWeb09 Robust04 ClueWeb12
SDM KNRM - 0.1880 0.3016 0.0968
SDM Conv-KNRM - 0.1894 0.2907 0.0896
SDM EDRM - 0.2015 0.2993 0.0937
SDM TK - 0.2306 0.2822 0.0966
SDM BERT Base - 0.2701 0.4168 0.1183
SDM ELECTRA Base - 0.2861 0.4668 0.1078

* MS MARCO Passage Ranking

Retriever Reranker Coor-Ascent dev eval
BM25 BERT Base - 0.349 0.345
BM25 ELECTRA Base - 0.352 0.344
BM25 RoBERTa Large - 0.386 0.375
BM25 ELECTRA Large - 0.388 0.376

* MS MARCO Document Ranking

Retriever Reranker Coor-Ascent dev eval
ANCE FirstP - - 0.373 0.334
ANCE MaxP - - 0.383 0.342
ANCE FirstP+BM25 BERT Base FirstP + 0.431 0.380
ANCE MaxP BERT Base MaxP + 0.432 0.391

* Classic Features

Methods ClueWeb09-B Robust04 TREC-COVID
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
BM25 (Anserini) 0.2773 0.1426 0.4129 0.1117 0.6979 0.7670
RankSVM (Dai et al.) 0.289 n.a. 0.420 n.a. n.a. n.a.
RankSVM (OpenMatch) 0.2825 0.1476 0.4309 0.1173 0.6995 0.7570
Coor-Ascent (Dai et al.) 0.295 n.a. 0.427 n.a. n.a. n.a.
Coor-Ascent (OpenMatch) 0.2969 0.1581 0.4340 0.1171 0.7041 0.7770

Contribution

Thanks to all the people who contributed to OpenMatch!

Kaitao Zhang, Si Sun, Zhenghao Liu, Aowei Lu

Project Organizers

  • Zhiyuan Liu
  • Chenyan Xiong
  • Maosong Sun

Citation

@inproceedings{openmatch,
  author = {Liu, Zhenghao and Zhang, Kaitao and Xiong, Chenyan and Liu, Zhiyuan and Sun, Maosong},
  title = {OpenMatch: An Open Source Library for Neu-IR Research},
  booktitle = {Proceedings of SIGIR},
  year = {2021},
  url = {https://doi.org/10.1145/3404835.3462789},
  pages = {2531–2535}
}
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Numerical-computing-is-fun - Learning numerical computing with notebooks for all ages.

As much as this series is to educate aspiring computer programmers and data scientists of all ages and all backgrounds, it is also a reminder to mysel

EKA foundation 758 Dec 25, 2022
Supporting code for the paper "Dangers of Bayesian Model Averaging under Covariate Shift"

Dangers of Bayesian Model Averaging under Covariate Shift This repository contains the code to reproduce the experiments in the paper Dangers of Bayes

Pavel Izmailov 25 Sep 21, 2022
Keyword-BERT: Keyword-Attentive Deep Semantic Matching

project discription An implementation of the Keyword-BERT model mentioned in my paper Keyword-Attentive Deep Semantic Matching (Plz cite this github r

1 Nov 14, 2021
PyTea: PyTorch Tensor shape error analyzer

PyTea: PyTorch Tensor Shape Error Analyzer paper project page Requirements node.js = 12.x python = 3.8 z3-solver = 4.8 How to install and use # ins

ROPAS Lab. 240 Jan 02, 2023
PyTorch implementation of Barlow Twins.

Barlow Twins: Self-Supervised Learning via Redundancy Reduction PyTorch implementation of Barlow Twins. @article{zbontar2021barlow, title={Barlow Tw

Facebook Research 839 Dec 29, 2022
Learning Facial Representations from the Cycle-consistency of Face (ICCV 2021)

Learning Facial Representations from the Cycle-consistency of Face (ICCV 2021) This repository contains the code for our ICCV2021 paper by Jia-Ren Cha

Jia-Ren Chang 40 Dec 27, 2022
Minimal fastai code needed for working with pytorch

fastai_minima A mimal version of fastai with the barebones needed to work with Pytorch #all_slow Install pip install fastai_minima How to use This lib

Zachary Mueller 14 Oct 21, 2022
Histocartography is a framework bringing together AI and Digital Pathology

Documentation | Paper Welcome to the histocartography repository! histocartography is a python-based library designed to facilitate the development of

155 Nov 23, 2022
Code for "Modeling Indirect Illumination for Inverse Rendering", CVPR 2022

Modeling Indirect Illumination for Inverse Rendering Project Page | Paper | Data Preparation Set up the python environment conda create -n invrender p

ZJU3DV 116 Jan 03, 2023
Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator

DRL-robot-navigation Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gra

87 Jan 07, 2023
Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Patient Knowledge Distillation for BERT Model Compression Knowledge distillation for BERT model Installation Run command below to install the environm

Siqi 180 Dec 19, 2022
Transfer Learning library for Deep Neural Networks.

Transfer and meta-learning in Python Each folder in this repository corresponds to a method or tool for transfer/meta-learning. xfer-ml is a standalon

Amazon 245 Dec 08, 2022
Low-code/No-code approach for deep learning inference on devices

EzEdgeAI A concept project that uses a low-code/no-code approach to implement deep learning inference on devices. It provides a componentized framewor

On-Device AI Co., Ltd. 7 Apr 05, 2022
This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Hierarchical Motion Understanding via Motion Programs (CVPR 2021) This repository contains the official implementation of: Hierarchical Motion Underst

Sumith Kulal 40 Dec 05, 2022
This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape

Metashape-Utils This repository allows the user to automatically scale a 3D model/mesh/point cloud on Agisoft Metashape, given a set of 2D coordinates

INSCRIBE 4 Nov 07, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag HyperTag helps humans intuitively express how they think about their files using tags and machine learning.

Ravn Tech, Inc. 165 Nov 04, 2022
This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

12 Nov 22, 2022
"Learning Free Gait Transition for Quadruped Robots vis Phase-Guided Controller"

PhaseGuidedControl The current version is developed based on the old version of RaiSim series, and possibly requires further modification. It will be

X-Mechanics 12 Oct 21, 2022
Fast and scalable uncertainty quantification for neural molecular property prediction, accelerated optimization, and guided virtual screening.

Evidential Deep Learning for Guided Molecular Property Prediction and Discovery Ava Soleimany*, Alexander Amini*, Samuel Goldman*, Daniela Rus, Sangee

Alexander Amini 75 Dec 15, 2022
Simultaneous Demand Prediction and Planning

Simultaneous Demand Prediction and Planning Dependencies Python packages: Pytorch, scikit-learn, Pandas, Numpy, PyYAML Data POI: data/poi Road network

Yizong Wang 1 Sep 01, 2022