Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Last update: Oct 31, 2022

Overview

CodeFill

This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences", DOI: 10.1145/3510003.3510172. This work is authored by Maliheh Izadi, Roberta Gismondi, and Georgios Gousios and it has been accepted for publication at #ICSE2022.

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Data

Our datasets are available on HuggingFace hub.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Related tags

Overview

CodeFill

Abstract

Data

Owner

Software Analytics Lab

Help you discover excellent English projects and get rid of disturbing by other spoken language

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

Making text a first-class citizen in TensorFlow.

Tools for curating biomedical training data for large-scale language modeling

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Telegram AI chat bot written in Python using Pyrogram

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Automatically search Stack Overflow for the command you want to run

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Fake Shakespearean Text Generator

A natural language processing model for sequential sentence classification in medical abstracts.

Ukrainian TTS (text-to-speech) using Coqui TTS

Script and models for clustering LAION-400m CLIP embeddings.

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

Datasets of Automatic Keyphrase Extraction