Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Overview

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Introduction

Graph Neural Networks (GNNs) have demonstrated superior performance in node classification or regression tasks, and have emerged as the state of the art in several applications. However, (inductive) GNNs require the edge connectivity structure of nodes to be known beforehand to work well. This is often not the case in several practical applications where the node degrees have power-law distributions, and nodes with a few connections might have noisy edges. An extreme case is the strict cold start (SCS) problem, where there is no neighborhood information available, forcing prediction models to rely completely on node features only. To study the viability of using inductive GNNs to solve the SCS problem, we introduce feature-contribution ratio (FCR), a metric to quantify the contribution of a node's features and that of its neighborhood in predicting node labels, and use this new metric as a model selection reward. We then propose Cold Brew, a new method that generalizes GNNs better in the SCS setting compared to pointwise and graph-based models, via a distillation approach. We show experimentally how FCR allows us to disentangle the contributions of various components of graph datasets, and demonstrate the superior performance of Cold Brew on several public benchmarks

Motivation

Long tail distribution is ubiquitously existed in large scale graph mining tasks. In some applications, some cold start nodes have too few or no neighborhood in the graph, which make graph based methods sub-optimal due to insufficient high quality edges to perform message passing.

gnns

gnns

Method

We improve teacher GNN with Structural Embedding, and propose student MLP model with latent neighborhood discovery step. We also propose a metric called FCR to judge the difficulty in cold start generalization.

gnns

coldbrew

Installation Guide

The following commands are used for installing key dependencies; other can be directly installed via pip or conda. A full redundant dependency list is in requirements.txt

pip install dgl
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
pip install torch-geometric

Training Guide

In options/base_options.py, a full list of useable args is present, with default arguments and candidates initialized.

Comparing between traditional GCN (optimized with Initial/Jumping/Dense/PairNorm/NodeNorm/GroupNorm/Dropouts) and Cold Brew's GNN (optimized with Structural Embedding)

Train optimized traditional GNN:

python main.py --dataset='Cora' --train_which='TeacherGNN' --whetherHasSE='000' --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 84.15

python main.py --dataset='Citeseer' --train_which='TeacherGNN' --whetherHasSE='000' --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 71.00

python main.py --dataset='Pubmed' --train_which='TeacherGNN' --whetherHasSE='000' --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 78.2

Training Cold Brew's Teacher GNN:

python main.py --dataset='Cora' --train_which='TeacherGNN' --whetherHasSE='100' --se_reg=32 --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 85.10

python main.py --dataset='Citeseer' --train_which='TeacherGNN' --whetherHasSE='100' --se_reg=0.5 --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 71.40

python main.py --dataset='Pubmed' --train_which='TeacherGNN' --whetherHasSE='111' --se_reg=0.5 --want_headtail=1 --num_layers=2 --use_special_split=1 Result: 78.2

Comparing between MLP models:

Training naive MLP:

python main.py --dataset='Cora' --train_which='StudentBaseMLP' Result on isolation split: 63.92

Training GraphMLP:

python main.py --dataset='Cora' --train_which='GraphMLP' Result on isolation split: 68.63

Training Cold Brew's MLP:

python main.py --dataset='Cora' --train_which="SEMLP" --SEMLP_topK_2_replace=3 --SEMLP_part1_arch="2layer" --dropout_MLP=0.5 --studentMLP__opt_lr='torch.optim.Adam&0.005' Result on isolation split: 69.57

Hyperparameter meanings

--whetherHasSE: whether cold brew's TeacherGNN has structural embedding. The first ‘1’ means structural embedding exist in first layer; second ‘1’ means structural embedding exist in every middle layers; third ‘1’ means last layer.

--se_reg: regularization coefficient for cold brew teacher model's structural embedding.

--SEMLP_topK_2_replace: the number of top K best virtual neighbor nodes.

--manual_assign_GPU: set the GPU ID to train on. default=-9999, which means to dynamically choose GPU with most remaining memory.

Adaptation Guide

How to leverage this repo to train on other datasets:

In trainer.py, put any new graph dataset (node classification) under load_data() and return it.

what to load: return a dataset, which is a namespace, called 'data', data.x: 2D tensor, on cpu; shape = [N_nodes, dim_feature]. data.y: 1D tensor, on cpu; shape = [N_nodes]; values are integers, indicating the class of nodes. data.edge_index: tensor: [2, N_edge], cpu; edges contain self loop. data.train_mask: bool tensor, shape = [N_nodes], indicating the training node set. Template class for the 'data':

class MyDataset(torch_geometric.data.data.Data):
    def __init__(self):
        super().__init__()

Citation

comming soon.
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

GWpy 342 Jan 07, 2023
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022