Startup Infrastructure: No Matter the Choice, You’re Still Wrong
Unpredictable changes in a startup’s first few years make the best infrastructure choices right today and wrong tomorrow.
I recently had someone respond to an article I wrote and say, among other things, that if your team is in a situation where you realize that the cloud provider and service you chose isn’t the best fit and you need to switch… it’s your own fault.
“…the ‘oh I’m not getting the performance I need from this provider’ is a failure of your engineers to vet the services they are using for your use cases.”
To be fair, I like to hold engineers to a high bar, so I don’t completely disagree. However, it’s not reasonable, especially in the context of an early-stage startup, to expect engineers to look into the future and predict through the next two years what will stay the same, what will change, and what those changes will be.
I wrote an article about senior engineers and what it means to be one. The point I make is that the most senior engineers understand one thing: it’s not possible to be absolutely right. Even if you look at all the factors, collect all the data, and understand every facet of the problem, things change. You can’t predict change. It’s all the more true with infrastructure.
Let’s assume for a moment that I’m wrong. Here are the factors, based on my personal experience, that have led to infrastructure changes in my career, and that you would have to predict to be right about your infrastructure choices for more than one to two years in an early-stage startup environment:
Sometimes, as a startup, you have to react to the market. It’s possible that interest rates go up, conflict breaks out in the Middle East, or a regime collapses in South America, and — somehow — you have to modify your product to support a new feature. It sounds crazy, but startups are crazy.
A competitor might modify their product to do what you’re doing with twice the performance and half the wait time for end-users. Your customers begin leaving your product for the competitors because the value of that time saved is too high to ignore.
All of this can lead to infrastructure changes. You might need to use a different cloud service, possibly on another cloud provider, that has lower concurrency restrictions, higher memory limits, lower cost, or better performance for your specific new use case.
In the search for that holy grail that we call product-market-fit, it’s not unusual for your product to turn out completely different from what it started as initially. If it did change, that’s a good thing. If you have the humility to listen to the market and change your product to fit the needs that exist, you’re ahead of the game.
You might start a dating site and end up with a video blogging platform, start an MMORPG game and end up with an online instant messaging platform, or start a social-good network and end up with a daily deals app. All this can happen in a year or two.
Sometimes, at an early-stage startup, you have one person, out of the massive three-person company, who is the sole infrastructure expert. Many teams refer to this as the bus factor, and good teams will seek to spread that expertise around when the bus factor is less than three. But at a startup, there’s little you can do. Now, what happens when the bus comes and takes your one infrastructure expert off to a seven-figure job at Google?
Hiring isn’t easy or cheap. If you can’t find a replacement with the same expertise, you might need to change your infrastructure approach. Your new expert might insist on changes to continue or improve performance.
Users: those beautiful, sweet, wonderful, annoying, painful, awful little things that keep us going; they can surprise you, and their behaviors can change. At an early-stage startup, when customers begin complaining about platform stability, you will do whatever you need to, no matter how drastic, to satisfy them.
For instance, at Teamflow—a Slack meets Zoom company—initially, we used GCP App Engine because of its simplicity and cost-effectiveness. But as users grew and exhibited spikey behavior (e.g., everyone logging in around 8 AM), App Engine couldn’t keep up. We had to switch to Cloud Run, a change we couldn’t foresee but had to make.
When the runway starts to run out, people start to ask deep questions about the infrastructure. It’s typically the largest cost after salaries, so it gets the side eye first.
Sometimes, in the beginning, we make sub-par choices because of budget constraints. When the budget increases because your startup is succeeding and growing, you’ll want to take that corner-cutting infrastructure choice you made and sharpen it up. Depending on how it was built, you might be in a tough spot.
If you still think you can make infrastructure choices that are right in the present and will be right in the future… I can’t help you, good luck. If you see what I see, you’re realizing there’s no hope; you can’t predict varying factors moving unpredictably through time. It’s time to consider your ability and capacity to make infrastructure changes.
Personally, I write my code in an infrastructure-agnostic way. I call it cloud-nimble. I don’t depend directly on cloud APIs (like AWS S3), and within my build pipeline, I reduce my services to their smallest individual unit: functions. When I need to make changes, I can move a function, or a collection of functions, to a serverful service, a serverless service, or do a lot of other dynamic things with them.