In a past life I had a Wall of Shame of headlines on firmware update fails. The ...

In a past life I had a Wall of Shame of headlines on firmware update fails.

The lesson was you built firmware updates upfront and right into your development process so it became a non-event. You put in lots of tests, including automatic verification and rollback recovery. You made it so everyone was 100% comfortable pushing out updates, like every hour. It wasn't this big, scary release thing.

You did binary deltas so each update was small, and trickle download during down-time. You did A/B partitions, or if you had flash space, A/B/C updates (current firmware, new update, last known good one). Bricking devices and recalls are expensive and cause reputational damage. Adding OTA requires WiFi, BLE, or cell, which increases BOM cost and backend support. Trade-off is manual updates requiring dealership visits or on-site tech support calls with USB keys. It doesn't scale well. For consumer devices, it leads to lots of unpatched, out-of-date devices, increasing support costs and legal risk. OTA also lets you push out in stages and do blue-green deployment testing.

For security, you had on-device asymmetric encryption keys and signed each update, then rolled the keys so if someone reverse-engineered the firmware, it wouldn't be a total loss. Ideally add a TPM to the BOM with multiple key slots and a HW encryption engine. Anyone thinking about shipping unencrypted firmware, or baking symmetric encryption keys into firmware should be publicly flogged.

You also needed a data migration system so user-customizations aren't wiped out. My newish car, to this day, resets most user settings when it gets an OTA. No wonder people turn off automatic updates.

The really good systems also used realistic device simulators to measure impact before even pushing things out. And you definitely tested for communication failures and interruptions. Like, yoink out a power-line mid-update and then watch what happens after power is back on. Yes, it's costly and time-consuming, but consider the alternatives.

The ones that failed the most were when they spent months or years developing the basic system, then tacked on update at the end as part of deployment. Since firmware update wasn't as sexy as developing cool new tech, this was doled out to lower-tier devs who didn't know what they were doing. Also, doing it at the end of the project meant it was often the least-tested feature.

The other sin was waiting months before rolling out updates, so there were lots of changes packed into one update, which made a small failure have a huge blast radius.

These were all technical management failures. Designing a robust update system should be right up-front in the project plan, built by your best engineers, then including it in the CI/CD pipeline.

Just for context, the worst headline I had was for update failure in a line of hospital infant incubators.