[ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

Overview

Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised Learning for Program Repair (ICML 2021).

@InProceedings{yasunaga2021break,
  author =  {Michihiro Yasunaga and Percy Liang},
  title =   {Break-It-Fix-It: Unsupervised Learning for Program Repair},
  year =    {2021},  
  booktitle = {International Conference on Machine Learning (ICML)},  
}

Problem: Repair Task

Our approach: BIFI

0. Dependencies

Specifically, run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n BIFI python=3.7.7
conda activate BIFI
pip install tqdm
pip install torch==1.4.0 torchvision==0.5.0
cd utils/fairseq
pip install -e .
pip numpy==1.20.1 editdistance

1. Download Data

Download all the data from here (data.zip) and unzip it (note: 67GB when compressed, 400GB when decompressed). This includes the GitHub-Python dataset, and all the processed training data and trained models associated with BIFI. If you only want the original GitHub-Python dataset, you can download it from here (data_minimal.zip; 1GB). After unzipping the data.zip, the resulting file structure will look like:

.
├── README.md
└── data/
    ├── orig_bad_code/       (GitHub-Python dataset's bad code)
    ├── orig_good_code/      (GitHub-Python dataset's good code)
    └── round0/
        ├── data_paired      (paired data used to train fixer in round0)
        └── model-fixer      (fixer trained in round0)
    ├── round1-BIFI-part1/
        ├── data_paired      (paired data used to train breaker in BIFI round1)
        └── model-breaker    (breaker trained in BIFI round1)
    ├── round1-BIFI-part2/
        ├── data_paired      (paired data used to train fixer in BIFI round1)
        └── model-fixer      (fixer trained in BIFI round1)
    ├── ...

About the GitHub-Python dataset

We collected 3 million Python3 snippets from GitHub. Using the critic (Python AST parser), the code snippets are split into a set of bad code (with AST parse errors) and a set of good code (with no errors). The set of bad code is located at data/orig_bad_code/orig.bad.json and good code at data/orig_good_code/orig.good.json. Each entry of orig.bad.json or orig.good.json is a dictionary consisting of

  • "code_string": raw code in the string format
  • "code_toks_joined": the raw code is split into tokens by Python tokenizer, anonymized (string/number is replaced with special tokens <STRING>/<NUMBER>), and then joined by whitespace. The tokenization was done by utils/code_utils.py: tokenize_python_code()
  • "anonymize_dict": mapping betweens raw string/number and <STRING>/<NUMBER> so that "code_string" can be recovered from "code_toks_joined". This recovery can be done by utils/code_utils.py: code_toks_to_code_string()
  • "err_obj": type of the error caught by the critic (e.g. unbalanced parentheses, indentation error). This is only applicable to orig.bad.json.

The bad code snippets in orig.bad.json are split into 5 chunks (orig.0.bad to orig.4.bad in data/orig_bad_code/), where 3,4 is heldout as the test set and 0,1,2 is made available for BIFI training. This splitting was done by scripts/split_orig_bad_and_good.py

2. Training and Evaluation

First, train the initial fixer by running commands in src/run-round0.py one by one. We then consider three training algorithms on top of it: BIFI (our proposed method), FixerOnly (BIFI without breaker), and BackTranslation (BT; our baseline). For each algorithm,

  • BIFI: run commands in src/run-BIFI.py one by one
  • FixerOnly: run commands in src/run-FixerOnly.py one by one
  • BT: run commands in src/run-BT.py one by one

Below is an illustration for the case of BIFI.

run-round0.sh

export PYTHONPATH=.

#Train initial fixer on synthetic paired data
python src/c001__train_fixer.py --round_name round0 --gpu_id 0 --max_epoch 2

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round0 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round0

run-BIFI.sh (round 1)

#Use the fixer outputs on the bad code (chunk 0,1,2) to get new paired data (Equation 6 in the paper)
python src/c006__generate_paired_data_from_fixer.py --round_name round0 --out_round_name round1-BIFI-part1

#Train breaker on the new paired data (Equation 7 in the paper)
python src/c002__train_breaker.py --round_name round1-BIFI-part1 --gpu_id 0 --max_epoch 3

#Run the trained breaker on the good code and get new paired data (Equation 8 in the paper)
python src/c004__run_breaker.py   --round_name round1-BIFI-part1 --gpu_ids '0,1,2,3,4'
python src/c007__generate_paired_data_from_breaker.py --round_name round1-BIFI-part1 --out_round_name round1-BIFI-part2

#Train fixer on the new paired data (Equation 9 in the paper)
python src/c001__train_fixer.py --round_name round1-BIFI-part2 --gpu_id 0 --max_epoch 2 --continue_from 'data/round0/model-fixer/checkpoint.pt'

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round1-BIFI-part2 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round1-BIFI-part2

This is repeated similarly for round 2.

Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
Dataset and Code for the paper "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021), and "Depth-only Object Tracking" (BMVC2021)

DeT and DOT Code and datasets for "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021) "Depth-only Object Tracking" (BMVC2021) @InProceedings

Yan Song 55 Dec 15, 2022
Angular & Electron desktop UI framework. Angular components for native looking and behaving macOS desktop UI (Electron/Web)

Angular Desktop UI This is a collection for native desktop like user interface components in Angular, especially useful for Electron apps. It starts w

Marc J. Schmidt 49 Dec 22, 2022
This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach.

PlyTitle_Generation This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach. The paper has been accepted by

SeungHeonDoh 6 Jan 03, 2022
CLADE - Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)

Efficient Semantic Image Synthesis via Class-Adaptive Normalization (Accepted by TPAMI)

tzt 49 Nov 17, 2022
Framework for evaluating ANNS algorithms on billion scale datasets.

Billion-Scale ANN http://big-ann-benchmarks.com/ Install The only prerequisite is Python (tested with 3.6) and Docker. Works with newer versions of Py

Harsha Vardhan Simhadri 132 Dec 24, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
The AugNet Python module contains functions for the fast computation of image similarity.

AugNet AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation arxiv link In our work, we propose AugNet, a new deep le

Ming 74 Dec 28, 2022
The official repository for Deep Image Matting with Flexible Guidance Input

FGI-Matting The official repository for Deep Image Matting with Flexible Guidance Input. Paper: https://arxiv.org/abs/2110.10898 Requirements easydict

Hang Cheng 51 Nov 10, 2022
Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Mish: Self Regularized Non-Monotonic Activation Function BMVC 2020 (Official Paper) Notes: (Click to expand) A considerably faster version based on CU

Xa9aX ツ 1.2k Dec 29, 2022
ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet)

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet) (

Wei-Ting Chen 49 Dec 27, 2022
Predicting future trajectories of people in cameras of novel scenarios and views.

Pedestrian Trajectory Prediction Predicting future trajectories of pedestrians in cameras of novel scenarios and views. This repository contains the c

8 Sep 03, 2022
Machine Learning with JAX Tutorials

The purpose of this repo is to make it easy to get started with JAX. It contains my "Machine Learning with JAX" series of tutorials (YouTube videos and Jupyter Notebooks) as well as the content I fou

Aleksa Gordić 372 Dec 28, 2022
A simple and useful implementation of LPIPS.

lpips-pytorch Description Developing perceptual distance metrics is a major topic in recent image processing problems. LPIPS[1] is a state-of-the-art

So Uchida 121 Dec 24, 2022
particle tracking model, works with the ROMS output file(qck.nc, his.nc)

particle-tracking-model-for-ROMS particle tracking model, works with the ROMS output file(qck.nc, his.nc) description this is a 2-dimensional particle

xusheng 1 Jan 11, 2022
MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

MASA-SR Official PyTorch implementation of our CVPR2021 paper MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Re

DV Lab 126 Dec 20, 2022
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Resilient projection-based consensus actor-critic (RPBCAC) algorithm

Resilient projection-based consensus actor-critic (RPBCAC) algorithm We implement the RPBCAC algorithm with nonlinear approximation from [1] and focus

Martin Figura 5 Jul 12, 2022
Temporally Coherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Duc Linh Nguyen 2 Jan 18, 2022
N-RPG - Novel role playing game da turfu

N-RPG Ce README sera la page de garde du projet. Contenu Il contiendra la présen

4 Mar 15, 2022
Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing"

One-Shot Free-View Neural Talking Head Synthesis Unofficial pytorch implementation of paper "One-Shot Free-View Neural Talking-Head Synthesis for Vide

ZLH 406 Dec 23, 2022