Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Last update: Nov 15, 2022

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Owner

Bayesian Methods Research Group

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

A Python package for time series augmentation

Implementation of Hierarchical Transformer Memory (HTM) for Pytorch

Learning Visual Words for Weakly-Supervised Semantic Segmentation

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

hySLAM is a hybrid SLAM/SfM system designed for mapping

Dataset Condensation with Contrastive Signals

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Building blocks for uncertainty-aware cycle consistency presented at NeurIPS'21.

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

GAN Image Generator and Characterwise Image Recognizer with python

Sandbox for training deep learning networks

PyContinual (An Easy and Extendible Framework for Continual Learning)

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Neural Oblivious Decision Ensembles

Real-world Anomaly Detection in Surveillance Videos- pytorch Re-implementation

A Pytorch loader for MVTecAD dataset.

Training deep models using anime, illustration images.

PROJECT - Az Residential Real Estate Analysis

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining