Supervised Contrastive Learning for Product Matching

Overview

Contrastive Product Matching

This repository contains the code and data download links to reproduce the experiments of the paper "Supervised Contrastive Learning for Product Matching" by Ralph Peeters and Christian Bizer. ArXiv link. A comparison of the results to other systems using different benchmark datasets is found at Papers with Code - Entity Resolution.

  • Requirements

    Anaconda3

    Please keep in mind that the code is not optimized for portable or even non-workstation devices. Some of the scripts may require large amounts of RAM (64GB+) and GPUs. It is advised to use a powerful workstation or server when experimenting with some of the larger files.

    The code has only been used and tested on Linux (CentOS) servers.

  • Building the conda environment

    To build the exact conda environment used for the experiments, navigate to the project root folder where the file contrastive-product-matching.yml is located and run conda env create -f contrastive-product-matching.yml

    Furthermore you need to install the project as a package. To do this, activate the environment with conda activate contrastive-product-matching, navigate to the root folder of the project, and run pip install -e .

  • Downloading the raw data files

    Navigate to the src/data/ folder and run python download_datasets.py to automatically download the files into the correct locations. You can find the data at data/raw/

    If you are only interested in the separate datasets, you can download the WDC LSPC datasets and the deepmatcher splits for the abt-buy and amazon-google datasets on the respective websites.

  • Processing the data

    To prepare the data for the experiments, run the following scripts in that order. Make sure to navigate to the respective folders first.

    1. src/processing/preprocess/preprocess_corpus.py
    2. src/processing/preprocess/preprocess_ts_gs.py
    3. src/processing/preprocess/preprocess_deepmatcher_datasets.py
    4. src/processing/contrastive/prepare_data.py
    5. src/processing/contrastive/prepare_data_deepmatcher.py
  • Running the Contrastive Pre-training and Cross-entropy Fine-tuning

    Navigate to src/contrastive/

    You can find respective scripts for running the experiments of the paper in the subfolders lspc/ abtbuy/ and amazongoogle/. Note that you need to adjust the file path in these scripts for your system (replace your_path with path/to/repo).

    • Contrastive Pre-training

      To run contrastive pre-training for the abtbuy dataset for example use

      bash abtbuy/run_pretraining_clean_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE (AUG)

      You need to specify batch site, learning rate and temperature as arguments here. Optionally you can also apply data augmentation by passing an augmentation method as last argument (use all- for the augmentation used in the paper).

      For the WDC Computers data you need to also supply the size of the training set, e.g.

      bash lspc/run_pretraining_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE TRAIN_SIZE (AUG)

    • Cross-entropy Fine-tuning

      Finally, to use the pre-trained models for fine-tuning, run any of the fine-tuning scripts in the respective folders, e.g.

      bash abtbuy/run_finetune_siamese_frozen_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE (AUG)

      Please note, that BATCH_SIZE refers to the batch size used in pre-training. The fine-tuning batch size is locked to 64 but can be adjusted in the bash scripts if needed.

      Analogously for fine-tuning WDC computers, add the train size:

      bash lspc/run_finetune_siamese_frozen_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE TRAIN_SIZE (AUG)


Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
Web-based Systems Group @ University of Mannheim
We explore technical and empirical questions concerning the development of global, decentralized information environments.
Web-based Systems Group @ University of Mannheim
🏃‍♀️ A curated list about human motion capture, analysis and synthesis.

Awesome Human Motion 🏃‍♀️ A curated list about human motion capture, analysis and synthesis. Contents Introduction Human Models Datasets Data Process

Dennis Wittchen 274 Dec 14, 2022
🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

Monitor deep learning model training and hardware usage from mobile. 🔥 Features Monitor running experiments from mobile phone (or laptop) Monitor har

labml.ai 1.2k Dec 25, 2022
Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

TTNet-Pytorch The implementation for the paper "TTNet: Real-time temporal and spatial video analysis of table tennis" An introduction of the project c

Nguyen Mau Dung 438 Dec 29, 2022
Weight initialization schemes for PyTorch nn.Modules

nninit Weight initialization schemes for PyTorch nn.Modules. This is a port of the popular nninit for Torch7 by @kaixhin. ##Update This repo has been

Alykhan Tejani 69 Jan 26, 2021
Like a cowsay but without cows!

Foxsay This is a simple program that generates pictures of a cute fox with a message. It is like a cowsay but without cows! Fox girls are better! Usag

Anastasia Kim 28 Feb 20, 2022
RoadMap and preparation material for Machine Learning and Data Science - From beginner to expert.

ML-and-DataScience-preparation This repository has the goal to create a learning and preparation roadMap for Machine Learning Engineers and Data Scien

33 Dec 29, 2022
Generate high quality pictures. GAN. Generative Adversarial Networks

ESRGAN generate high quality pictures. GAN. Generative Adversarial Networks """ Super-resolution of CelebA using Generative Adversarial Networks. The

Lieon 1 Dec 14, 2021
Safe Model-Based Reinforcement Learning using Robust Control Barrier Functions

README Repository containing the code for the paper "Safe Model-Based Reinforcement Learning using Robust Control Barrier Functions". Specifically, an

Yousef Emam 13 Nov 24, 2022
JAX-based neural network library

Haiku: Sonnet for JAX Overview | Why Haiku? | Quickstart | Installation | Examples | User manual | Documentation | Citing Haiku What is Haiku? Haiku i

DeepMind 2.3k Jan 04, 2023
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
MEND: Model Editing Networks using Gradient Decomposition

MEND: Model Editing Networks using Gradient Decomposition Setup Environment This codebase uses Python 3.7.9. Other versions may work as well. Create a

Eric Mitchell 141 Dec 02, 2022
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ⠀ A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
State-of-the-art data augmentation search algorithms in PyTorch

MuarAugment Description MuarAugment is a package providing the easiest way to a state-of-the-art data augmentation pipeline. How to use You can instal

43 Dec 12, 2022
Incorporating Transformer and LSTM to Kalman Filter with EM algorithm

Deep learning based state estimation: incorporating Transformer and LSTM to Kalman Filter with EM algorithm Overview Kalman Filter requires the true p

zshicode 57 Dec 27, 2022
MAT: Mask-Aware Transformer for Large Hole Image Inpainting

MAT: Mask-Aware Transformer for Large Hole Image Inpainting (CVPR2022, Oral) Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, Jiaya Jia [Paper] News This

254 Dec 29, 2022
Official Implementation of Neural Splines

Neural Splines: Fitting 3D Surfaces with Inifinitely-Wide Neural Networks This repository contains the official implementation of the CVPR 2021 (Oral)

Francis Williams 56 Nov 29, 2022
Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016.

Fully Convolutional Networks for Semantic Segmentation This is the reference implementation of the models and code for the fully convolutional network

Evan Shelhamer 3.2k Jan 08, 2023
An adaptive hierarchical energy management strategy for hybrid electric vehicles

An adaptive hierarchical energy management strategy This project contains the source code of an adaptive hierarchical EMS combining heuristic equivale

19 Dec 13, 2022
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
A python program to hack instagram

hackinsta a program to hack instagram Yokoback_(instahack) is the file to open, you need libraries write on import. You run that file in the same fold

2 Jan 22, 2022