Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Deep LearningAVATAR
Overview

AVATAR

  • Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
  • AVATAR stands for jAVA-pyThon progrAm tRanslation.
  • AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
  • Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

Table of Contents

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

  • CodeForces
  • AtCoder
  • CodeJam
  • GeeksforGeeks
  • LeetCode
  • ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

  • Removing docstrings, comments, etc.
  • Use baseline models' tokenizer to perform tokenization.
  • Filter data based on length threshold (~512).
  • Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh
  • Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
  • Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
  • We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training Models Java to Python Python to Java
CA BLEU SM DM CB EM CA BLEU SM DM CB EM
None Naive Copy - 23.4 - - - 0.0 - 26.9 - - - 0.0
TransCoder 76.9 36.8 31.0 17.1 29.1 0.1 100 49.4 37.6 18.5 31.9 0.0
TC-DOBF 77.7 43.4 29.7 33.9 34.8 0.0 100 46.1 36.0 12.6 28.8 0.0
From Scratch Seq2Seq+Attn. 66.5 56.3 39.1 18.4 37.9 1.0 71.8 62.7 46.6 28.5 43.0 0.8
Transformer 61.5 38.9 34.2 16.5 29.1 0.0 67.4 45.6 45.7 26.4 37.4 0.1
Pre-trained CodeGPT 47.3 38.2 32.5 11.5 26.1 1.1 71.2 44.0 38.8 26.7 33.8 0.1
CodeGPT-adapted 48.1 38.2 32.5 12.1 26.2 1.2 68.6 42.4 37.2 27.2 33.1 0.5
CodeBERT 62.3 59.3 37.7 16.2 36.7 0.5 74.7 55.3 38.4 22.5 36.1 0.6
GraphCodeBERT 65.7 59.7 38.9 16.4 37.1 0.7 57.2 60.6 48.4 20.6 40.1 0.4
PLBARTmono 76.4 67.1 42.6 19.3 43.3 2.4 34.4 69.1 57.1 34.0 51.4 1.2
PLBARTmulti 70.4 67.1 42.0 17.6 42.4 2.4 30.8 69.4 56.6 34.5 51.8 1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
Simple renderer for use with MuJoCo (>=2.1.2) Python Bindings.

Viewer for MuJoCo in Python Interactive renderer to use with the official Python bindings for MuJoCo. Starting with version 2.1.2, MuJoCo comes with n

Rohan P. Singh 62 Dec 30, 2022
Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

transformer-from-scratch Code for my Medium blog post: Transformers from Scratch in PyTorch Note: This Transformer code does not include masked attent

Frank Odom 27 Dec 21, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022
Repository for MeshTalk supplemental material and code once the (already approved) 16 GHS captures our lab will make publicly available are released.

meshtalk This repository contains code to run MeshTalk for face animation from audio. If you use MeshTalk, please cite @inproceedings{richard2021mesht

Meta Research 221 Jan 06, 2023
Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

Aspuru-Guzik group repo 55 Nov 29, 2022
Label Hallucination for Few-Shot Classification

Label Hallucination for Few-Shot Classification This repo covers the implementation of the following paper: Label Hallucination for Few-Shot Classific

Yiren Jian 13 Nov 13, 2022
Tensorflow 2 Object Detection API kurulumu, GPU desteği, custom model hazırlama

Tensorflow 2 Object Detection API Bu tutorial, TensorFlow 2.x'in kararlı sürümü olan TensorFlow 2.3'ye yöneliktir. Bu, görüntülerde / videoda nesne a

46 Nov 20, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.

Microsoft 7.6k Jan 01, 2023
Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021)

Tracing Versus Freehand for Evaluating Computer-Generated Drawings (SIGGRAPH 2021) Zeyu Wang, Sherry Qiu, Nicole Feng, Holly Rushmeier, Leonard McMill

Zach Zeyu Wang 23 Dec 09, 2022
Model-based reinforcement learning in TensorFlow

Bellman Website | Twitter | Documentation (latest) What does Bellman do? Bellman is a package for model-based reinforcement learning (MBRL) in Python,

46 Nov 09, 2022
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

Wenwen Yu 498 Dec 24, 2022
Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

NVIDIA Isaac ROS 343 Jan 03, 2023
FNet Implementation with TensorFlow & PyTorch

FNet Implementation with TensorFlow & PyTorch. TensorFlow & PyTorch implementation of the paper "FNet: Mixing Tokens with Fourier Transforms". Overvie

Abdelghani Belgaid 1 Feb 12, 2022
The official repository for Deep Image Matting with Flexible Guidance Input

FGI-Matting The official repository for Deep Image Matting with Flexible Guidance Input. Paper: https://arxiv.org/abs/2110.10898 Requirements easydict

Hang Cheng 51 Nov 10, 2022
[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

IFAN: Iterative Filter Adaptive Network for Single Image Defocus Deblurring Checkout for the demo (GUI/Google Colab)! The GUI version might occasional

Junyong Lee 173 Dec 30, 2022
Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset

Semantic Segmentation on MIT ADE20K dataset in PyTorch This is a PyTorch implementation of semantic segmentation models on MIT ADE20K scene parsing da

MIT CSAIL Computer Vision 4.5k Jan 08, 2023
The fastai book, published as Jupyter Notebooks

English / Spanish / Korean / Chinese / Bengali / Indonesian The fastai book These notebooks cover an introduction to deep learning, fastai, and PyTorc

fast.ai 17k Jan 07, 2023
FG-transformer-TTS Fine-grained style control in transformer-based text-to-speech synthesis

LST-TTS Official implementation for the paper Fine-grained style control in transformer-based text-to-speech synthesis. Submitted to ICASSP 2022. Audi

Li-Wei Chen 64 Dec 30, 2022
GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images

GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images Collaborative Global-Local Networks for Memory-Efficient Segmentation of Ultra-

VITA 298 Dec 12, 2022
MultiTaskLearning - Multi Task Learning for 3D segmentation

Multi Task Learning for 3D segmentation Perception stack of an Autonomous Drivin

2 Sep 22, 2022