I'm curious how people approach big red buttons and intentional graceful degradation in practice, and especially how to ensure that these work when the system is experiencing problems.
E.g. do you use db-based "feature flags"? What do you do then if the DB itself is overloaded, or the API through which you access the DB?
Or do you use static startup flags (e.g. env variables)? How do you ensure these get rolled out quickly enough?
Something else entirely?
When you're a small company, simpler is actually better... It's best to keep it simple so that recovery is easy over building out a more complicated solution that is reliable in the average case but fragile in the limits. Even if that means there's some places on the critical path where you don't use double redundancy but as a result the system is simple enough to fit in the heads of all the maintainers and can be rebooted or reverted easily.
... But once your firm starts making guarantees like "five nines uptime," there will be some complexity necessary to devise a system that can continue to be developed and improved while maintaining those guarantees.
At google we also had to routinely do “backend drains” of particular clusters when we deemed them unhealthy and they had a system to do that quickly at the api/lb layer. At other places I’ve also seen that done with application level flags so you’d do kubectl edit which is obviously less than ideal but worked
Implantation details will depend on your stack, but 3 main things I’d keep in mind:
1. Keep it simple. No elaborate logic. No complex data stores. Just a simple checking of the flag.
2. Do it as close to the source as possible, but have limited trust in your clients - you may have old versions, things not propagating, bugs, etc. So best to have option to degrade both in the client and on the server. If you can do only one, so the server side.
3. Real world test! And test often. Don’t trust test environment. Test on real world traffic. Do periodic tests at small scale (like 0.1% of traffic) but also do more full scale tests on a schedule. If you didn’t test it, it won’t work when you need it. If it worked a year ago, it will likely not work now. If it’s not tested, it’ll likely cause more damage than it’ll solve.
To make up an example that doesn't depend on any of those things: imagine that I've added a new feature to Hacker News that allows users to display profile pictures next to their comments. Of course, we have built everything around microservices, so this is implemented by the frontend page generator making a call to the profile service, which does a lookup and responds with the image location. As part of the launch plan, I document the "big red button" procedure to follow if my new component is overloading the profile service or image repository: run this command to rate-limit my service's outgoing requests at the network layer (probably to 0 in an emergency). It will fail its lookups and the page generator is designed to gracefully degrade by continuing to render the comment text, sans profile photo.
(Before anyone hits send on that "what a stupid way to do X" reply, please note that this is not an actual design doc, I'm not giving advice on how to build anything, it's just a crayon drawing to illustrate a point)
I worked at enough companies to see that many of them have some notion of "centralized config that's rolled to the edge and can be updated at runtime".
I've done this with djb's CDB (constant database). But I've seen people poll an API for JSON config files or dbm/gdbm/Berkeleydb/leveldb.
This can extend to other big red buttons. It's not that elegant but I've had numerous services that checked for the presence of a file to serve health checks. So pulling a node out of load balancer rotation was as easy as creating a file.
The idea is that then when there's a datab outage the system defaults to serving the last known good config.
Versioning, feature toggle and rollback - automated and implemented at different level based on the system. It could be an env configuration, or db field or down migration scripts or deploying last working version.
E.g. do you use db-based "feature flags"? What do you do then if the DB itself is overloaded, or the API through which you access the DB? Or do you use static startup flags (e.g. env variables)? How do you ensure these get rolled out quickly enough? Something else entirely?