VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Last update: Oct 24, 2022

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on [email protected].

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Results on IdBench benchmarks

Similarity

Method	Small	Medium	Large
FT-SG	0.30	0.29	0.28
LV	0.32	0.30	0.30
FT-cbow	0.35	0.38	0.38
VarCLR-Avg	0.47	0.45	0.44
VarCLR-LSTM	0.50	0.49	0.49
VarCLR-CodeBERT	0.53	0.53	0.51

Combined-IdBench	0.48	0.59	0.57
Combined-VarCLR	0.66	0.65	0.62

Relatedness

Method	Small	Medium	Large
LV	0.48	0.47	0.48
FT-SG	0.70	0.71	0.68
FT-cbow	0.72	0.74	0.73
VarCLR-Avg	0.67	0.66	0.66
VarCLR-LSTM	0.71	0.70	0.69
VarCLR-CodeBERT	0.79	0.79	0.80

Combined-IdBench	0.71	0.78	0.79
Combined-VarCLR	0.79	0.81	0.85

Pre-train your own VarCLR models

Coming soon.

Cite

If you find VarCLR useful in your research, please cite our [email protected]:

@misc{chen2021varclr,
      title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
      author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
      year={2021},
      eprint={2112.02650},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Related tags

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

Step 1: Load a Pre-trained VarCLR Model

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

Get embeddings of list of variables (supports batching)

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

Get pairwise (N * M) similarity scores from two lists of variables

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

Compute VarCLR scores and evaluate

Let's compare with the original CodeBERT

Results on IdBench benchmarks

Similarity

Relatedness

Pre-train your own VarCLR models

Cite

Owner

squaresLab

PyTorch implementation of PP-LCNet

Neural style in TensorFlow! 🎨

Classification models 1D Zoo - Keras and TF.Keras

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Minecraft agent to farm resources using reinforcement learning

This is the open-source reference implementation of the SIGGRAPH 2021 paper Intersection-free Rigid Body Dynamics.

This project is based on our SIGGRAPH 2021 paper, ROSEFusion: Random Optimization for Online DenSE Reconstruction under Fast Camera Motion .

This repository contains part of the code used to make the images visible in the article "How does an AI Imagine the Universe?" published on Towards Data Science.

Draw like Bob Ross using the power of Neural Networks (With PyTorch)!

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Official implementation for paper: Feature-Style Encoder for Style-Based GAN Inversion

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes (CVPR 2021 Oral)

PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021)

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

existing and custom freqtrade strategies supporting the new hyperstrategy format.

MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

Ultra-lightweight human body posture key point CNN model. ModelSize:2.3MB HUAWEI P40 NCNN benchmark: 6ms/img,

Implementation of RegretNet with Pytorch