Andreas Fragner Andreas Fragner

Agents are commoditizing the complement

Implementation — writing code — is a complement to specification — writing down what the code should do. Agents are commoditizing the former, pushing up demand for the latter.

I’ve been using LLMs for writing code since late 2022, starting with the Copilot VS Code extension. I’ve seen improvements to my own output early on but steered clear of the hypetrain and never bought into outrageous claims of 100x productivity improvements, simply because they didn’t reflect my own experience. Those claims for early models were obvious hyperbole in hindsight, and Github Copilot ca 2022 was in fact little more than smart code completion. But things have changed qualitatively over the last couple of months as agents finally started to work in 2025. The current generation of models — and importantly, advances in context engineering and the tooling built around them — are starting to fundamentally change the nature of writing software, and therefore what it means to be an engineer.

Do I still wish we had 2x as many engineers on the team? Absolutely. But the engineers I want now are different from the ones I would’ve hired for a couple of years ago. For a long time, you could make a good living doing pure implementation — knowing the ins-and-outs of your chosen language, every trick in the SQL book, etc and using that to write optimized implementations based on unambiguous specs. While that continues to be relevant, I suspect the amount of value you’re going to be able to capture from a pure implementation focus is going to drop fairly quickly from here. Agents are a multiplier on your software and system design skills, less so on your implementation skills. What matters now is writing clear specs and requirements (and writing clearly and concisely in general), having strong intuition for the right abstractions, and a deep understanding of what drives codebase complexity and how to keep it in check. Demand for those skills has always been high but it’s going to explode in the coming years.

It’s a classic case of commoditizing the complement. Implementation — writing code — is an economic complement to specification — writing down what the code should do. They’re strong (but not perfect) complements since each on their own isn’t worth nearly as much without the other. You can write code that does things but without a clear spec, it’s unlikely to deliver good value. Likewise, you can write the perfect spec but if you can’t implement it — within time and resource constraints — it remains an academic exercise. Coding agents are commoditizing implementation, leading to an increase in demand for its complement, writing specs.

Making software is insanely costly today and that cost is almost entirely driven by personnel. Historically, quality software has always been supply- not demand-constrained. There’s no obvious limit to how much good software we need in the world. Commoditizing one of its key supplies is a really exciting development.

Read More
Andreas Fragner Andreas Fragner

The LLM productivity puzzle

Code generation is arguably one of the most interesting applications of LLMs, and one of the first with real commercial use (Copilot/Codex, Codegen, etc.). If you spend time on the internet these days you’ll see people claim productivity gains ranging from 0 to 100x, selection-biased to the high end (1). Whenever you see several orders of magnitude of disagreement, it’s worth trying to understand why.

Code generation is arguably one of the most interesting applications of LLMs, and one of the first with real commercial use (Copilot/Codex, Codegen, etc.). If you spend time on the internet these days you’ll see people claim productivity gains ranging from 0 to 100x, selection-biased to the high end (1). Whenever you see several orders of magnitude of disagreement, it’s worth trying to understand why.

While the extremes can almost certainly be explained as either deliberate hyperbole (promoters with no real experience writing code) or uninformed contrarianism (naysayers who have not made any serious attempt at using LLMs), there is a simpler and less cynical explanation for the divergence: It’s a reflection of the diversity of tasks involved in writing software.

Software development means a lot of different things, and it’s only natural to expect a new tool to be more suited to some types of engineering than others. If you build a standard component from scratch (e.g. a web dashboard with simple UI), odds are the requirements can be specified in a reasonable size prompt. If you’re building on top decades of legacy code with lots of non-obvious design decisions baked in, then (a) communicating that context to the model is hard (i.e. would require a long sequence of carefully crafted prompts), and (b) even if you manage to, it might not be able to make sense of it. As far as we know, LLMs don’t understand the structure of code at any fundamental level and it’s not clear that they can pick up on the non-local context required to speed up development on complex tasks by, say, 10x.

All this means that your mileage will vary depending on the kind of engineering you do. From what I’ve seen, LLM enthusiasts tend to work on things that have a high degree of isolation and require relatively little context, while the naysayers work on systems with lots of proprietary frameworks. To be sure, LLMs can be useful for either type of work but it’s clear that you’ll find it easier to get good results on the former. The key lies in using your intuition as an engineer — and your understanding of how LLMs work — to pick the right tasks.

A simple example that highlights the divergence in perceived usefulness is code translation. People variously report perfect results (translated code compiles and works as intended) to useless fragments (translated code doesn’t run, needs a lot of fixes). I’ve experienced both ends of the spectrum, even within the same language pair. Translating utility functions works flawlessly almost always. On the other hand, a recent attempt at translating a method from Node to Go using ChatGPT failed miserably since the function was using protobuf-generated objects and the model wasn’t able to figure out how attribute assignment differed between the Node and Go bindings.

It’s early days for LLM code generation and I’m certain we’ll see a lot of improvement over time. How quickly this happens remains to be seen. The fact that LLMs perform well on program synthesis is considered to be the result of “emergence”: Training on large amounts of commented code gives the model a weak supervised signal for code generation (2). If you make the model large enough and the datasets big enough, the ability to generate code from prompts emerges. I remain skeptical that the kind of understanding of non-local context needed for complex engineering tasks can emerge simply by scaling to ever larger models and datasets (3).


If you write a lot of code and use LLMs to do it, reach out on Twitter or email me. I’m keen to collect more data and hear about other people’s experiences.

Footnotes

(1) The loudest people tend to either have had the most success in using them or have a vested interest in raising attention (e.g. promoting their coding bootcamp or youtube channel).

(2) https://arxiv.org/abs/2203.13474

(3) [Edit 2023-04-01] Steve Yegge at Sourcegraph highlights an intriguing approach to overcoming context size limitations of LLMs — using code search to optimally populate context for a given prompt.

Read More