Using fewer parts
Fewer parts make for better software and better products.
The best-performing firms make a narrow range of products very well. The best firms’ products also use up to 50 percent fewer parts than those made by their less successful rivals. Fewer parts means a faster, simpler (and usually cheaper) manufacturing process. Fewer parts means less to go wrong; quality comes built in. And although the best companies need fewer workers to look after quality control, they also have fewer defects and generate less waste.
— Yvon Chouinard, Let my people go surfing
Chouinard’s observation applies to software products almost verbatim. Using fewer parts makes for better software: Easier to maintain, easier to extend, better margins. But what does “fewer parts” mean? And how do you know which ones to remove?
Fewer parts means making parts reusable. A good design minimizes number of components at constant functionality. That means avoiding duplication and making things reusable. If you can reimplement a system with a smaller number of components (functions, classes, services, etc.), that’s a sign that the original solution was either over- or under-engineered. Over-engineered because it introduced abstractions that weren’t necessary; under-engineered because it failed to identify reusable parts. It can be tempting to make fewer but larger components but those almost always end up being less re-usable. You might have fewer functions in such a design but you don’t have fewer parts.
Fewer parts means fewer representations of the data. All else equal, the amount of logic required to support n representations of the same data scales like n². It’s not uncommon for teams to maintain protobuf models, SQL schemas, Open API specs, GraphQL schemas, etc. all to support a single product. They might have a source of truth that defines the “core” data models (e.g. in protobuf), but still end up spending a ton of bandwidth on maintaining model converters and crafting migrations. Most people intuitively prefer to have fewer data representations, but the challenge is that different applications typically need different views or different derived properties of the data. That can lead to a proliferation of derived models which may not have strict one-to-one relationships with the original models.
Fewer parts means fewer languages and fewer tools. There is almost never a good enough reason to add another language to your stack. The increase in complexity and maintenance burden is consistently underestimated vs. the benefits. The same goes for databases. Performance reasons are often not strong enough to justify adding a new type of DB to cater to your latest special use case.
Fewer parts means smaller teams. Smaller teams spend less time coordinating and more time building and owning things. In most start-ups, a small number of engineers (3-4) build the first iteration of the product, which ends up generating 80% of the lifetime value of the product. It’s clearly possible to build complex things with a small, focused team. But as more money is raised, engineering teams balloon because they lose focus and add components that are not directly aligned with creating customer value. It’s Parkinson’s law at work. Companies perceive things to be mission-critical for the product, then craft a budget based on that, which must then be used once allocated, so more people are hired who then produce yet more parts, and so on.
Fewer parts means fewer counterparties. Most things break at the boundaries (especially if they’re external). The greater the surface area, the riskier and the harder to maintain a system becomes. Prefer to deal with a small number of high-quality vendors, and be prepared to pay a premium. The obvious interjection here is concentration risk: If a key vendor goes into administration or decides to drop the product you rely on, that might pose an existential risk to you. Such counterparty risk can indeed matter greatly and needs to be considered, but I’ve found in practice it’s often more manageable than people think. There are SLAs and contractual notice periods, and the majority of counterparties will honor them, giving you time to adjust. If you do need to replace a vendor, you start out with a much clearer picture of the requirements and the scope of the integration, which cuts down on time-to-market.
If using fewer parts is a good idea, how come modern software production appears to be so bloated? Dozens of vendors, a stack that’s 7 layers deep and includes 4 languages, teams of 60+ developers, etc. feel like the norm. Clearly, companies believe they need this many parts to deliver value to customers. Few people are deliberately trying to waste resources after all. But the problem is that people lose sight of what activities actually create value. As a company grows, a disconnect starts to develop between the activities performed by its employees and the value that is delivered to customers. In a 10 person firm, everyone speaks to customers, everyone knows the value chain and everyone uses the product. In a 1000 person firm, by definition most employees have never spoken to customers and may work on parts of the system that are increasingly far removed from what the customer sees. This is one instance where great management can make a huge difference. In well-managed firms, management goes to great lengths to communicate the link between firm activities and value creation. The focus is on customers and the problems they face, rather than process and efficiency gains. If you focus on serving your customers better, efficiency will take care of itself.
A few principles I follow to keep the number of parts small:
Hire fewer but better people and pay them more.
Work with fewer but better vendors and be willing to pay a premium. Be systematic about selecting them and understand the risks.
Each project you decide to allocate resources to must have a 3-4 sentence description of how it creates value for customers. People often struggle with this if the work is abstract or far removed from what the customer sees (say, work on infrastructure) but I’ve found it’s always possible if the work is worth pursuing.
Early-stage engineering
Early on you need to be fast. And to do that you have to have the confidence to break with best practices.
Early on you need to be fast. Your team, your stack, your infrastructure — they all need to be set up for that. To do that, you have to have the confidence to break with best practices. That confidence comes from knowing what risks actually matter in your context. The risks you care about when you’re building version 1.0 are very different from the risks a large organization cares about. The way you approach engineering has to reflect that.
When you start out building something new, everything is in flux. Your requirements aren’t understood yet, your data model will evolve, your API boundaries will shift, your interfaces haven’t firmed up yet, etc. That’s natural, and it’s key to embrace that uncertainty when working on something new. At that point it’s all about optimizing feedback loops. Make them as tight and fast as possible. What does that look like in practice:
Instrument everything — If you move fast, things break more often. You need to be able to figure out quickly and easily what’s wrong. People often interject here that adding instrumentation is extra work you can’t afford at this stage. The trick is to make it a total no-op. It should take minimal developer effort to get things like tracing, metrics and log aggregation in place. Like, zero is the goal here. Tracing in particular is such an easy win because it doesn’t require any thinking. Adding a span to a method takes at most one or two lines of (templated) code. Over time you’ll want to capture more information on a span and that takes more thinking, but just knowing the code path something took typically solves like 80% of the puzzle.
Make deployments automated and continuous — This one should be non controversial at this point. Every merge into
main
should trigger an image build, which gets deployed automatically. No action required. No release cycles. At Kappa we do on the order of 100s of “deployments” a day. A change is live in the dev cluster within 2 minutes of being merged. You get (near) immediate feedback.Make running things locally easy and cheap — Even faster than deploying things is to just run them locally. Make it as easy as possible to run services locally and to connect to other services running remotely. Running a whole cluster locally can sometimes be hard given hardware constraints but that’s almost never necessary (Corollary: Buy good machines for everyone, see below). One interesting development here are services like Modal which try to abstract away the gap between local vs. cloud infra completely.
Make writing tests easy and cheap — The reason people don’t write more tests is because it’s hard and takes time. So it makes sense to invest to bring that cost down — e.g. by auto-generating mocks for your services, or writing sample data generators to give you representative data for your domain. It’s pretty clear at this point that LLMs have changed the game for unittests. It takes all of two clicks / two copy-pastes now to generate a reasonable test suite. Most of the time there are issues/mistakes the model makes, but they’re typically easy to fix. Net-net it can still be a big time saver.
Integration tests over unit tests — Integration tests that run on every build or multiple times a day give you fast, meaningful feedback. Modern systems are distributed and it’s the boundaries were most of the bugs sit. Unit tests are fine but if you have to choose on what to spend your time on, write integration tests. The components of your software obviously need to work in isolation but it’s really the interactions where things go wrong. Especially if those interactions are asynchronous.
Minimize wait times — Waiting for CI to finish, waiting for a code review, waiting for something to build, etc. — these things are especially detrimental to productivity because they keep you from getting closure on one piece of work and discourage you from moving on to the next task. Even if the work itself is done, it still lingers until it’s deployed. This is one strong argument for choosing a language that compiles and builds quickly.
No branch protections — One way to eliminate PR approval wait times is to not require them. Sounds crazy but you wouldn’t believe how much time is wasted waiting for a review on a trivial change (the true cost is even higher than wall time because waiting (and checking) breaks your flow state). So trust your engineers. We’re all adults here. If your team is 5 people with experience, you can coordinate your work often well enough without PRs, just over Slack. You do end up with merge issues at times, but they’re typically infrequent and easy to resolve because early on people tend to work on fairly orthogonal things. Most definitely the time spent on resolving those is easily made up for the increase in velocity.
Minimize task overhead — This one is almost tautological at this point. Maximize interrupted blocks of time for people to focus. Minimize meetings and process.
Automate stack upgrades — A lot of time can be wasted when you don’t update dependencies until you’re forced to for compatibility reasons. That’s when you have to deal with a potentially large number of issues all at once, usually at the worst possible time. This is easy to fix: Just set up Dependabot.
Buy good machines for everyone — The added cost of getting high-spec machines for everyone amortizes literally in a day. Remove the constraint of local hardware as much as you can. The added cost for a team of 5 is totally negligible compared to what you pay on cloud compute.
Hire owners and generalists — The person making a change is also responsible for ensuring that it actually works once deployed. Integration tests go a long way here, but sometimes you actually have to make an API call or open the app and check UX impact. If you wait for QA to catch issues, you’ve wasted 3 days to find out you had a bug somewhere. And because you’re often out of context at that point, it becomes harder to fix.
Understand your team’s strengths — While everyone agrees that hiring great ICs is important, far too little thought goes into team composition. In fact, it’s often completely absent from recruitment plans. This is strange, since in areas outside software engineering, like professional sports, it gets at least as much attention. Building a technical product from scratch is a high performance team sport. You need great individual performers, but you also need them to complement each other, technically and personality-wise.
Some of the points above may sound crazy to someone in a mature engineering org. And for good reason! Your approach has to evolve as your product matures. The key is to understand what risks you need to care about at the point you’re at. Zero-risk deployments, well-managed sprints, carefully groomed tickets, etc. — these things all sound great in isolation, but the risk-adjusted return of doing them is just too low in the beginning.
The only risks you should care about early on are existential ones: (1) Running out of cash before you launch, (2) launching too late to get enough proof points, (3) shipping too late to iterate meaningfully, (4) being too slow to incorporate feedback. The risks that are considered existential in a larger org are just fundamentally different. Reputation, competitive threats, losing customers, losing market share, product stability, service uptime — those things matter when you have an existing product with good traction. But early on, you don’t have many customers yet, and those you do have are (hopefully) more forgiving. There’s typically also little to no competition to worry about. If not, you may want to reconsider what you’re working on.
I believe a significant number of startups die because they cling to all the best practices of later stage engineering — doing what big companies do. What these companies do is solve for their problems, not yours. Blindly following their advice means you end up over indexing on the wrong risks. The material in books and on blogs is heavily biased towards late stage engineering. People simply have more time to write when there’s an existing product with stable cash flow and growth. And that’s why it’s so important to think for yourself and understand your idiosyncratic risks.