Someone asked me on a mailing list if I worry about reaching my team’s limit in our ability to manage Redis. I wrote a bit about why I think our architecture helps mitigate that risk. I’m not sure I really expressed that I think flexibility through simplicity vs adapters and abstractions is the goal, but I like what I wrote. I think it might be helpful if you’re asking yourself, “Should I buy or should I build?” when it comes to lower level services.
As a rule on this mailing list, I share stories instead of advice. I can’t know everything you’re doing and applying what worked for me wont work for you. But learning how people reacted to situations and the outcomes from those actions, that’s helpful.
Full text below.
Of course I worry about what I don’t know, but I don’t believe amazon support will be able to answer the questions faster than a consultant. I’ve worked with many vendors across a lot of product space and frankly by the time we get answers we’ve rebuilt the clusters/redeployed elsewhere, differently etc.
Every outage of our own making has been due to a change in our system or we hit capacity limits in ways we didn’t see coming. And we’ve been able to roll back and scale up while we figure out what happened.
Our worst case scenario (the cluster loosing quorum during a large data import) happened last year and while we had to be read only for a few hours and it took a week to properly recover. Our ops consultant wasn’t able to reproduce the failure but we did a lot of testing and vastly improved our cluster and application ability to mitigate the kind of failure. It was entirely a data issue after that and support wouldn’t have helped speed recovery.
Then there was the latest S3 outage, lambda requires s3, nothing we could do, support couldn’t help. It recovered before we could work around it. However our ops overhead here is tiny compared to running a k8 cluster or having app servers so it’s a trade off we don’t mind making right now.
What the stories don’t share is that we invested heavily in simplicity. A simple architecture that is well understood by everyone. It took over a year of effort but pretty much every engineer can now draw the full life cycle of a request. And our platform team can tell you the direction of data flow for all our services.
And really there’s no longer much to it and it’s uniform across services. (Primary flow is CDN => Render => api => data and back)
I’m pretty sure we could change cloud providers for our primary service in under a week (with a huge longtail for ops - a move mind you we’ve decided not to invest in testing at this time, risk < cost). And while I have no plans to leave Redis I think we could change data layers in a few months if our requirements changed. (Which they are!)
My app isn’t your app, I have a light write load and a heavy read load, and our customers are primarily read only. We heavily invested in product and decided to keep our ops light. Hardly a fit for everyone but it’s why I’m not worried about the unknown unknowns. There isn’t room for too many of them anymore.