Two types of engineering for resiliency

Feb 9, 2018 Tags: #risk #engineering

I spend quite a lot of time thinking about the way we think.

In particular, I think a lot about the way we make engineering choices that mitigate risks. So, this is another brief note on the subject, following the beautiful metaphor I overheard:

There are two metaphors for resiliency and risk management in traditional engineering — the NASA way and the US Navy way.

NASA way is about provably correct systems, consistent engineering, repeatable practices and forbidding certain system behaviors that impose great risk, or working around it with systematic redundancy. It’s expensive, tedious, because the cost of any mistake is fatal, but unpredictable events can be modeled and designed against.
US Navy way is all about keeping personnel, processes, and tools in shape that helps manage SNAFU in the most efficient way with the shortest time to recovery, where unpredictable events are that unpredictable.

I’m not entirely sure how they reflect reality, given the story of Apollo Guidance Computer’s design, and the fact that US Navy runs on pretty sophisticated tech as well, but they are very valid to make the point.

When applied to modern computer engineering and security, we’re rarely thorough enough to go full-on NASA on many things, and too quickly jump to US Navy way for things we’ve barely learned to cope with. This is a natural reaction (wanting to cope with risks fast), but I suspect that many risks can be engineered against using the NASA way much more efficiently.

This is the paradox that kept me intrigued for years, but now it has an illustrative metaphor to wrap it into language.

NASA way makes design mistakes fatal, where US Navy way converts design mistakes into performance/expenditure penalty of some kind. The reason why computer (and specifically, security) engineering intuitively shifted to the latter isn’t that obvious. The distinction between the NASA way and the Navy way lies in the nature of risks.

Engineering against laws of nature is a deterministic analytical process, it’s building against the risks that are quantifiable, that can be formalized in a succinct scientific statement. In practice, that’s an iterative process, with it’s own trial and error, hypothesis-test-correction loops, but each new thing you learn and systematize gives you predictive power against some risks.

If you can model most of the risks and each failure teaches you to model better - sooner or later you will reach the surface of the moon. But can you model a civil war unfolding in a country populated by people you don’t understand? Not really.

Engineering against unknown, non-deterministic processes, with hardly quantifiable results before it’s too late, and where most of the tests give you a distant reflection of reality is much harder. That’s why it seems intuitively reasonable to go the Navy way when you deal with things like building security systems.

But this has a few tradeoffs.

Trade-off 1. Most of the uncertainty of modern engineering comes from shitty engineering, not some magical source of randomness, and the more shitty engineering we flood the market with, the less deterministic the end result is, the less systematic infrastructures and defenses are:

Cost optimization leads us to more JS programmers, but when we strive for reliability, we end up spending more on massive online, on-call, holistic engineering and ops.
Cost optimization leads us to code that has less security considerations taken initially, and massive sets of reactive and detective security controls.
Cost optimization leads us to more pentests and less SSDLC processes because you don’t really know what you’re shipping in 3 sprints and you can’t afford to have a security team nearby each sprint.

Trade-off 2. We often forget that US Navy still relies on plenty of advanced and reliable technology constructed the NASA way. The baseline for situational awareness is being able to get good data, and apply reaction tactics with predictable tools with predictable results. Relying fully on reactive tooling that is built ad-hoc on known risks is as efficient as repetitive risks are. In reality, they’re not, so you play a catch-up game all the time.

In practice, the designs are rarely 100% Navy or 100% NASA. But when design choices are being made - it makes sense to look at the risks when choosing the design style - what exactly are you engineering against? Can it be modelled precisely via a simple equation or managing complexity requires real-time resource allocation rather than pre-designing things?

← Previous
2017-10-16: Why decentralized social services fail (so far)

Next →
2019-01-05: Kubler-Ross model for engineers turning into managers

ivychapel.ink

Two types of engineering for resiliency