Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Overview

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:

Title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Authors: Yue Wang, Weishi Wang , Shafiq Joty, and Steven C.H. Hoi

CodeT5 demo

Updates

Oct 29, 2021

We release fine-tuned checkpoints for all the downstream tasks covered in the paper.

Oct 25, 2021

We release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarzation. Below is how to use this model:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

if __name__ == '__main__':
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')

    text = """def svg_to_image(string, size=None):
    if isinstance(string, unicode):
        string = string.encode('utf-8')
        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
    if not renderer.isValid():
        raise ValueError('Invalid SVG data.')
    if size is None:
        size = renderer.defaultSize()
        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
        painter = QtGui.QPainter(image)
        renderer.render(painter)
    return image"""

    input_ids = tokenizer(text, return_tensors="pt").input_ids

    generated_ids = model.generate(input_ids, max_length=20)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    # this prints: "Convert a SVG string to a QImage."

Oct 18, 2021

We add a model card for CodeT5! Please reach out if you have any questions about it.

Sep 24, 2021

CodeT5 is now in hugginface!

You can simply load the model (CodeT5-small and CodeT5-base) and do the inference:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate one code span
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints "{user.username}"

Introduction

This repo provides the code for reproducing the experiments in CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation . CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on 8.35M functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves state-of-the-art results on 14 sub-tasks in a code intelligence benchmark - CodeXGLUE.

Paper link: https://arxiv.org/abs/2109.00859

Blog link: https://blog.einstein.ai/codet5/

The code currently includes two pre-trained checkpoints (CodeT5-small and CodeT5-base) and scripts to fine-tine them on 4 generation tasks ( code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication of our paper.

In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:

  • Text-to-code generation: generate code based on the natural language description.
  • Code autocompletion: complete the whole function of code given the target function name.
  • Code summarization: generate the summary of a function in natural language description.

Table of Contents

  1. Citation
  2. License
  3. Dependency
  4. Download
  5. Fine-tuning
  6. Get Involved

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{
    wang2021codet5,
    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
    author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
    year={2021},
}

License

The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the following:

This software should not be used to promote or profit from:

violence, hate, and division,

environmental destruction,

abuse of human rights, or

the destruction of people's physical and mental health.

We encourage users of this software to tell us about the applications in which they are putting it to use by emailing [email protected], and to use appropriate documentation when developing high-stakes applications of this model.

Dependency

  • Pytorch 1.7.1
  • tensorboard 2.4.1
  • transformers 4.6.1
  • tree-sitter 0.2.2

Download

Instructions to download:

# pip install gsutil
cd your-cloned-codet5-path

gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .

Fine-tuning

Go to sh folder, set the WORKDIR in exp_with_args.sh to be your cloned CodeT5 repository path.

You can use run_exp.py to run a broad set of experiments by simply passing the model_tag, task, and sub_task arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use the sub_task to specify which specific datasets to fine-tine on. Below is the full list:

--task --sub_task Description
summarize ruby/javascript/go/python/java/php code summarization task on CodeSearchNet data with six PLs
concode none text-to-code generation on Concode data
translate java-cs/cs-java code-to-code translation between Java and C#
refine small/medium code refinement on code repair data with small/medium functions
defect none code defect detection in C/C++ data
clone none code clone detection in Java data

For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:

python run_exp.py --model_tag codet5_base --task summarize --sub_task python

Besides, you can specify:

model_dir: where to save fine-tuning checkpoints
res_dir: where to save the performance results 
summary_dir: where to save the training curves
data_num: how many data instances to use, the default -1 is for using the full data
gpu: the index of the GPU to use in the cluster

You can also revise the suggested arguments here or directly customize the exp_with_args.sh bash file. Please refer to the argument flags in configs.py for the full available options. The saved training curves in summary_dir can be visualized using tensorboard. Note that we employ one A100 GPU for all fine-tuning experiments.

How to fine-tune on your own task and dataset?

If you want to fine-tune on your dataset, you can add your own task and sub_task in configs.py (here) and add your data path and the function to read in utils.py (here and here). The read function can be implemented in _utils.py similar to this one. If your task to add is a generation task, you can simply reuse or customize the run_gen.py. For understanding tasks, please refer to run_defect.py and run_clone.py.

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 03, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

Michael Hansen 988 Jan 04, 2023
Search-Engine - 📖 AI based search engine

Search Engine AI based search engine that was trained on 25000 samples, feel free to train on up to 1.2M sample from kaggle dataset, link below StackS

Vladislav Kruglikov 2 Nov 29, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

Anton Lozhkov 29 Oct 23, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023