Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Related tags

Deep LearningCodeGen
Overview

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.

We provide reference implementations of the following papers:

We also provide pre-trained models for language modeling, translation and deobfuscation.

Dependencies

Run install_env.sh. We use black code formatter.

Data

Source code processors

This repository contains programming languages processors for C++, Java and Python. These processors include:

  • tokenization and detokenization
  • obfuscation
  • function extractions

These processors are based on TreeSitter parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.

Example of code tokenization:

from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor

java_code = r"""class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}"""
java_processor = JavaProcessor(root_folder="<YOUR_TREESITER_FOLDER>")
tokenized_java_code = java_processor.tokenize_code(java_code)
print(tokenized_java_code)

BPE

This repository provides wrappers for fast BPE and Roberta BPE at file level.

Dataset Preprocessing

This repository contains a pipeline to create programming languages datasets. Now it supports four datasets modes:

  • Monolingual (ex: Java source code)
  • Monolingual Functions (ex: Java functions)
  • Monolingual Obfuscated (ex: Obfuscated Java source code. [Details here])
  • Monolingual Obfuscated Functions (ex: Obfuscated Java functions)

First, download C++ / Java / Python source code from Google BigQuery. To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given here.

The pipeline does the following:

  • Source code extraction from json (.json.gz) and tokenization (.tok)
  • Train BPE codes and vocab
  • Apply BPE (.bpe)
  • Binarization (.pth)
  • Symlink folder with appropriate file names for .pth (XLM-syml). To be given as data_path argument for training.

To run the pipeline :

python -m codegen_sources.preprocessing.preprocess \
<DATA_PATH> \                            # folder containing json.gz
--langs java cpp python  \               # languages to process
--mode monolingual_functions \           # dataset mode
--bpe_mode=fast_bpe \                    # BPE mode. by default it is fast_BPE. can be roberta_bpe
--local=True \                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--train_splits=1                         # Number of trainings splits

If you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. These tests test the pipeline on different modes. It will give you an overview of the possible options.

Also, we provide the BPE codes and vocabulary here. These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as fastbpe_code_path and fastbpe_vocab_path arguments.

In TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.

Model

Overview

In this repository, we provide code to train transformer-based models (code based on XLM repository). The available training tasks are the following:

  • Masked Language Model (MLM)
  • Causal Language Model (CLM)
  • Supervised Machine translation (MT)
  • Classification
  • Deobfuscation = DOBF
  • Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT)

We evaluate our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).

Also, we provide wrappers to fine-tune and evaluate our models on CodeXGLUE benchmark.

Download models

You can donwload the following models :

Re train specific models

To have details on how to retrain specific models, please refer to the README specific to each model.

References

TransCoder model (NeurIPS 2020)

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

DOBF

[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.

@article{roziere2021dobf,
  title={DOBF: A Deobfuscation Pre-Training Objective for Programming Languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},
  journal={arXiv preprint arXiv:2102.07492},
  year={2021}
}

* Equal Contribution

License

CodeGen is under the license detailed in the Creative Commons Attribution-NonCommercial 4.0 International license. See LICENSE for more details.

Owner
Facebook Research
Facebook Research
Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

Han Zhang 809 Dec 16, 2022
Dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K Our dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

96 Jul 05, 2022
Pytorch implementation for "Density-aware Chamfer Distance as a Comprehensive Metric for Point Cloud Completion" (NeurIPS 2021)

Density-aware Chamfer Distance This repository contains the official PyTorch implementation of our paper: Density-aware Chamfer Distance as a Comprehe

Tong WU 93 Dec 15, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
Source codes of CenterTrack++ in 2021 ICME Workshop on Big Surveillance Data Processing and Analysis

MOT Tracked object bounding box association (CenterTrack++) New association method based on CenterTrack. Two new branches (Tracked Size and IOU) are a

36 Oct 04, 2022
The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

HCQ_Tweet_Dataset: FREE to Download. Keywords: HCQ, hydroxychloroquine, tweet, twitter, COVID-19 This dataset is associated with the paper "Understand

2 Mar 16, 2022
Official PyTorch Implementation of HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning (NeurIPS 2021 Spotlight)

[NeurIPS 2021 Spotlight] HELP: Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning [Paper] This is Official PyTorch implementatio

42 Nov 01, 2022
Video Matting via Consistency-Regularized Graph Neural Networks

Video Matting via Consistency-Regularized Graph Neural Networks Project Page | Real Data | Paper Installation Our code has been tested on Python 3.7,

41 Dec 26, 2022
Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

Stijn Cuyvers 3 Sep 21, 2022
Code for "Unsupervised Layered Image Decomposition into Object Prototypes" paper

DTI-Sprites Pytorch implementation of "Unsupervised Layered Image Decomposition into Object Prototypes" paper Check out our paper and webpage for deta

40 Dec 22, 2022
A spatial genome aligner for analyzing multiplexed DNA-FISH imaging data.

jie jie is a spatial genome aligner. This package parses true chromatin imaging signal from noise by aligning signals to a reference DNA polymer model

Bojing Jia 9 Sep 29, 2022
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 96 Jan 03, 2023
Official implementation of CVPR2020 paper "Deep Generative Model for Robust Imbalance Classification"

Deep Generative Model for Robust Imbalance Classification Deep Generative Model for Robust Imbalance Classification Xinyue Wang, Yilin Lyu, Liping Jin

9 Nov 01, 2022
Relative Positional Encoding for Transformers with Linear Complexity

Stochastic Positional Encoding (SPE) This is the source code repository for the ICML 2021 paper Relative Positional Encoding for Transformers with Lin

Antoine Liutkus 48 Nov 16, 2022
PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021)

PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021) This repo presents PyTorch implementation of M

Evgeny 79 Dec 19, 2022
Contextual Attention Network: Transformer Meets U-Net

Contextual Attention Network: Transformer Meets U-Net Contexual attention network for medical image segmentation with state of the art results on skin

Reza Azad 67 Nov 28, 2022
Catalyst.Detection

Accelerated DL R&D PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentatio

Catalyst-Team 12 Oct 25, 2021
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

Mehdi Cherti 135 Dec 30, 2022
a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles Link to Other of my work 🌟 Deep Learning Notes: A collection of my notes going from basic mult

Ge Yang 137 Nov 09, 2022
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

47 Dec 16, 2022