The secret to good writing is editing
Says InCoder: A Generative Model for Code Infilling and Synthesis
It seems like there’s a new model coming out every other day now. If you’ve ever seen a large language model answer a question, it does so in a left-to-right manner — recursively spitting out tokens. But the process of writing something well isn’t necessarily just that — it’s also editing.
This is especially true in programming, where tasks like adding comments, fixing bugs, or renaming variables necessitate editing. Moreover, programs are often not written top-down but have complex dependencies. Can a language model ever learn to capture this?
Introduction and Motivation
Enter Incoder.
The Core Idea: The model randomly replaces a portion of code with a special marker and moves it to the end of the line. Then, the model is trained to predict what the code looks like in its new order. When it's time to edit a program, the model replaces a portion of the code with the special marker and then creates new code to fill the void. This lets the model make changes to the program without having to start over from scratch.
This clever approach casts editing as a next-token generation problem. And preliminary results look quite good.
The model uses a “causal masked objective”, which allows it to conditionally mask tokens. INCODER 7.6B was trained on 248 V100 GPUs for 24 days, and 159GB of total code.
Evaluation
The paper does some interesting ablation studies and demonstrates the following:
A casual masked objective often does better than just a casual objective or just masked generation.
I must emphasize — often but not always (see picture below). The paper does not posit a theory for why left-to-right reranking sometimes does better. We theorize leaving this to future work.
In the appendix, we see comparisons to actual equivalent models like Codex and CodeBERT. As mentioned in the image below, the comparison could be unfair, since Codex might contain CodeSearchNet (the test set) in its training set, causing data leakage.
Limitations and Future Work
First, the paper finally compares itself to Codex in the appendix and comes up short, citing that Codex could have the test set in its training data. I have two problems with this:
CodeBERT, lacking a similar problem, outperforms InCoder in certain languages, demonstrating better generalization.
If CodeSearchNet was not available, why was another test set not used?
The model compares to itself quite a bit and demonstrates many capabilities — variable name generation, return type generation, docstring generation, coding and so on. I would have appreciated seeing more comparisons with more SOTA models in these fields, to properly evaluate these techniques.
An interesting idea the data collection process of the model suggested — the model did much better when trained on data from StackOverflow. We could theorize that some combination of the explanations alongside the code allowed it to draw better correlations between natural language and programming. I would love to see this further explored.
In Summary — moving negative evaluations of your model to the appendix feels so common I should probably make it its own section in this newsletter (“Appendix Secrets”, perhaps?). I particularly dislike this trend in CS, because it teaches researchers to hide their flaws, and in my opinion, goes against the spirit of scientific inquiry.
However, I don’t mean to be cynical or belittle the accomplishments of the paper. Infilling is a very important next step, and InCoder is a step in the right direction in thinking about how to evolve language models beyond scale.