Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Last update: Nov 28, 2022

Related tags

Deep Learning SDR

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

Tested environment

Python 3.7
PyTorch 1.7
CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

Installation
Datasets
Train with our datasets
Hierarchical Inference
Cite

License, Security, support and code of conduct specifications are under the Instructions directory.

Installation

Run

bash instructions/installation.sh

Datasets

The published datasets are:

Video games
- 21,935 articles
- Expert annotated test set. 90 articles with 12 ground-truth recommendations.
- Examples:
  - Grand Theft Auto - Mafia
  - Burnout Paradise - Forza Horizon 3
Wines
- 1635 articles
- Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
- Examples:
  - Pinot Meunier - Chardonnay
  - Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.

Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Related tags

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

Tested environment

Contents

Installation

Datasets

Training

Tensorboard support

Inference

Results

Citing & Authors

Owner

Microsoft

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape Completion

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

The comma.ai Calibration Challenge!

Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Machine Learning with JAX Tutorials

Bare bones use-case for deploying a containerized web app (built in streamlit) on AWS.

BalaGAN: Image Translation Between Imbalanced Domains via Cross-Modal Transfer

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

MoveNet Single Pose on OpenVINO

Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

A benchmark framework for Tensorflow

Github project for Attention-guided Temporal Coherent Video Object Matting.

This repository is the code of the paper "Sparse Spatial Transformers for Few-Shot Learning".

Automatic Attendance marker for LMS Practice School Division, BITS Pilani

PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)