The LLM productivity puzzle
Code generation is arguably one of the most interesting applications of LLMs, and one of the first with real commercial use (Copilot/Codex, Codegen, etc.). If you spend time on the internet these days you’ll see people claim productivity gains ranging from 0 to 100x, selection-biased to the high end (1). Whenever you see several orders of magnitude of disagreement, it’s worth trying to understand why.
While the extremes can almost certainly be explained as either deliberate hyperbole (promoters with no real experience writing code) or uninformed contrarianism (naysayers who have not made any serious attempt at using LLMs), there is a simpler and less cynical explanation for the divergence: It’s a reflection of the diversity of tasks involved in writing software.
Software development means a lot of different things, and it’s only natural to expect a new tool to be more suited to some types of engineering than others. If you build a standard component from scratch (e.g. a web dashboard with simple UI), odds are the requirements can be specified in a reasonable size prompt. If you’re building on top decades of legacy code with lots of non-obvious design decisions baked in, then (a) communicating that context to the model is hard (i.e. would require a long sequence of carefully crafted prompts), and (b) even if you manage to, it might not be able to make sense of it. As far as we know, LLMs don’t understand the structure of code at any fundamental level and it’s not clear that they can pick up on the non-local context required to speed up development on complex tasks by, say, 10x.
All this means that your mileage will vary depending on the kind of engineering you do. From what I’ve seen, LLM enthusiasts tend to work on things that have a high degree of isolation and require relatively little context, while the naysayers work on systems with lots of proprietary frameworks. To be sure, LLMs can be useful for either type of work but it’s clear that you’ll find it easier to get good results on the former. The key lies in using your intuition as an engineer — and your understanding of how LLMs work — to pick the right tasks.
A simple example that highlights the divergence in perceived usefulness is code translation. People variously report perfect results (translated code compiles and works as intended) to useless fragments (translated code doesn’t run, needs a lot of fixes). I’ve experienced both ends of the spectrum, even within the same language pair. Translating utility functions works flawlessly almost always. On the other hand, a recent attempt at translating a method from Node to Go using ChatGPT failed miserably since the function was using protobuf-generated objects and the model wasn’t able to figure out how attribute assignment differed between the Node and Go bindings.
It’s early days for LLM code generation and I’m certain we’ll see a lot of improvement over time. How quickly this happens remains to be seen. The fact that LLMs perform well on program synthesis is considered to be the result of “emergence”: Training on large amounts of commented code gives the model a weak supervised signal for code generation (2). If you make the model large enough and the datasets big enough, the ability to generate code from prompts emerges. I remain skeptical that the kind of understanding of non-local context needed for complex engineering tasks can emerge simply by scaling to ever larger models and datasets (3).
If you write a lot of code and use LLMs to do it, reach out on Twitter or email me. I’m keen to collect more data and hear about other people’s experiences.
Footnotes
(1) The loudest people tend to either have had the most success in using them or have a vested interest in raising attention (e.g. promoting their coding bootcamp or youtube channel).
(2) https://arxiv.org/abs/2203.13474
(3) [Edit 2023-04-01] Steve Yegge at Sourcegraph highlights an intriguing approach to overcoming context size limitations of LLMs — using code search to optimally populate context for a given prompt.