Military standard on software control levels

Posted by ibobev 12 hours ago

Counter56Comment26OpenOriginal

Comments

Comment by AlotOfReading 11 hours ago

A lot of people look at safety critical development standards to try and copy process bits for quality. In reality, 90% of the quality benefits come from sitting down to think about the software and its role in the overall system. You don't need all the fancy methodologies and expensive tools. It's also the main benefit you get from formal methods.

I've found that a quality process that starts with "you need to comprehensively understand what you're engineering" is almost universally a non-starter for anyone not already using these things. Putting together an exhaustive list of all the ways code interacts with the outside world is hard. If a few engineers actually manage it, they're rarely empowered to make meaningful decisions on whether the consequences of failures are acceptable or fix things if they're not.

Comment by kqr 11 hours ago

It doesn't help that many of the popular methodologies focus entirely on failures. They ask a bunch of questions in the style of "how likely is it that this part fails?" "what happens if it fails?" "how can we reduce the risk of it failing?" etc. But software never fails[1] so that's the wrong approach to start from!

Much better to do as you say and think about the software and its role in the system. There are more and less formal ways to do this, but it's definitely better than taking a component view.

Comment by aidenn0 2 hours ago

Systems containing software fail, and the cause of that failure may originate in software.

And the article you intended to link is just wrong. E.g. the Therac-25 was not designed to output high power when an operator typed quickly; it was built in such a way to do so. This would be analogous to describing an airplane failure due to using bolts that were too weak: "the bolt didn't fail; it broke under exactly the forces you would expect it to break from its size; if they wanted it to not break, they should have used a larger bolt!" Just like in the Therac example, the failure would be consistently reproducible.

Comment by ryandrake 11 hours ago

FYI you added a [1] but didn't add the link to whatever you were going to reference!

Comment by teddyh 11 hours ago

It could have been this:

“The reason is that, in other fields [than software], people have to deal with the perversity of matter. [When] you are designing circuits or cars or chemicals, you have to face the fact that these physical substances will do what they do, not what they are supposed to do. We in software don't have that problem, and that makes it tremendously easier. We are designing a collection of idealized mathematical parts which have definitions. They do exactly what they are defined to do.

And so there are many problems we [programmers] don't have. For instance, if we put an ‘if’ statement inside of a ‘while’ statement, we don't have to worry about whether the ‘if’ statement can get enough power to run at the speed it's going to run. We don't have to worry about whether it will run at a speed that generates radio frequency interference and induces wrong values in some other parts of the data. We don't have to worry about whether it will loop at a speed that causes a resonance and eventually the ‘if’ statement will vibrate against the ‘while’ statement and one of them will crack. We don't have to worry that chemicals in the environment will get into the boundary between the if statement and the while statement and corrode them, and cause a bad connection. We don't have to worry that other chemicals will get on them and cause a short-circuit. We don't have to worry about whether the heat can be dissipated from this ‘if’ statement through the surrounding ‘while’ statement. We don't have to worry about whether the ‘while’ statement would cause so much voltage drop that the ‘if’ statement won't function correctly. When you look at the value of a variable you don't have to worry about whether you've referenced that variable so many times that you exceed the fan-out limit. You don't have to worry about how much capacitance there is in a certain variable and how much time it will take to store the value in it.

All these things are defined a way, the system is defined to function in a certain way, and it always does. The physical computer might malfunction, but that's not the program's fault. So, because of all these problems we don't have to deal with, our field is tremendously easier.”

— Richard Stallman, 2001: <https://www.gnu.org/philosophy/stallman-mec-india.html#conf9>

Comment by lukan 9 hours ago

Rowhammer, cosmic bitflip or hardware or just compiler bugs come to mind.

Comment by kqr 40 minutes ago

The first three are hardware failures, not software failures. The latter would be a design error, not a failure.

The software may need to handle hardware failures, but software that doesn't do that also doesn't fail -- it's inadequately designed.

Comment by teddyh 9 hours ago

None of those are something that you as a programmer should ever worry about.

Comment by BoppreH 8 hours ago

Counterpoint, I have definitely taken them into consideration when designing my backup script. It's the reason why I hash my files before transferring, after transferring, and at periodic intervals.

And if you're designing a Hardware Security Module, as another example, I hope that you've taken at least rowhammer into consideration.

Comment by AlotOfReading 6 hours ago

I consider these all the time as a programmer. Particularly compiler/toolchain bugs, which are relatively common once you start looking for them.

Comment by lo_zamoyski 9 hours ago

He makes a valid distinction, in a very specific sense. As long as we understand a program correctly, then we understand its behavior completely [0]. The same cannot be said of spherical cows (which, btw, can be modeled by computers, which means programs inherit the problems of the model, in some sense, and all programs model something).

However, that "as long as" is doing quite a bit of work. In practice, we rarely have a perfect grasp of a real world program. In practice, there is divergence between what we think a program does and what it actually does, gaps in our knowledge, and so on. Naturally, this problem also afflicts mathematical approximations of physical systems.

[0] And even this is not entirely true. Think of a concurrent program. Race conditions can produce all sorts of weird results that are unpredictable. Perfect knowledge of the program will not tell you what the result will be.

Comment by gmueckl 9 hours ago

While it is conceivably possible to write perfect software that will run flawlessly on a perfect computer forever, the reality is that the computer it runs on and the devices it controls will eventually fail - it's just a question of when and how, never if. A device that hasn't failed during its lifespan was simply not used long enough to fail.

In light of this, even software development has to focus on failures when you apply this standard. And that does include considerations like failures occurring with in the computer itself (faulty RAM or faulty CPU core).

Comment by lo_zamoyski 10 hours ago

Well, the failure in question is not the part failing to do what it is objectively defined to do, it is a failure to perform as we expect it to. Meaning, the failure is ours. Inductively, for `x` to FAIL means that either we failed to define `x` properly, or the `y` that simulates `x` (compiler, whatever...) has FAILed.

Of course, the notion of "failure" itself presupposes a purpose. It is a normative notion, and there is no normativity without an aim or a goal.

So, sure, where human artifacts are concerned, we cannot talk about a part failing per se, because unlike natural kinds (like us, where the norm is intrinsic to us, hence why heart failure is an objective failure), the "should" or "ought" of an artifact is a matter of external human intention and expectation.

And as it turns out, a "role in a system" is precisely a teleological view. The system has an overall purpose (one we assign to it), and the role or function of any part is defined in terms of - and in service to - the overall goal. If the system goes from `a->d`, and one part goes from `a->b`, another `b->c`, and still another `c->d`, then the composition of these gives us the system. The meaning of the part comes from the meaning of the whole.

Comment by MobiusHorizons 11 hours ago

I also generally find that people looking for “best practices” to follow are trying to avoid that “sitting down to think about the software and its role in the overall system” piece.

Comment by mubbicles 11 hours ago

Another good document for military standards for software safety is AOP-52.

Has some fun anecdotes in it. My favorite being the nuclear certified supersonic aircraft having a latent defect discovered during integration of a new subsystem. Turns out all of the onboard flight computers crashed at the transition from sub to supersonic, thankfully the aircraft had enough inertia to "ride through" all of their flight computers simultaneously crashing during the transonic boundary.

Moral of that story is your software people need to have the vocabulary to understand the physical properties of the system they're working on.

Comment by inamberclad 10 hours ago

Absolutely. If you look at an extensively used standard like DO-178C for avionics, it really says very little about how to program. Instead, the emphasis is on making sure that the software has implemented system level requirements correctly.

Comment by trklausss 8 hours ago

I see the fancy methodologies and processes as the way of streamlining what you have to do in order to "sit down to think about the software", particularly in teams of more than one developer.

Most of it happens, as always, at the interface. So these methodologies help you manage these interfaces between people, machine and product.

Comment by jcims 10 hours ago

>Putting together an exhaustive list of all the ways code interacts with the outside world is hard.

Maintaining it over time is even harder.

Comment by tehjoker 11 hours ago

I think the main benefit of these standards is that when someone proposes a project, the level gets evaluated and either enough (and appropriate) resources are allocated or it is killed in an ideal world.

Comment by AlotOfReading 11 hours ago

You'd hope. That's not always my experience. What I often see is cutting random bits off the development plan until the resource constraints are nominally satisfied, without much regard for whether the resulting plan is sensible. That's if there's a plan. Sometimes these systems get randomly assigned a level based on vibes, with the expectation that someone will later go back and fix the level if it's incorrect. This works about as well as commented TODOs.

Comment by exe34 10 hours ago

it's cargo culting. we see the same thing with "agile", which is often used as an excuse to just do what they were going to do anyway.

they want the benefits, and are willing to do everything except the things that actually help.

Comment by ldx1024 1 hour ago

"Although the standard is a little more complicated..."

If you have ever read the software control category definitions in MIL-STD-882E you know that the definitions that this blog author gives are very much his interpretation. The actual definitions in 882E are a god awful mess. Multiple contradictory definitions provided for the same category. Additional parenthetical statements that are intended to clarify, but just muddy the picture further. Yikes...

Comment by svilen_dobrev 8 hours ago

i prefer the "criticality" categorization of Alistair Cockburn in his crystal clear methodologies.. [1] (funny, none of the hundreds of copycats includes that - it's only findable in the book itself (pp ~240):

""" A second important dimension is criticality, the potential damage caused by an undetected defect: loss of comfort (C), loss of discretionary moneys (D), loss of essential moneys (E), and loss of life (L). """

(my rephrasing): he points that the more one moves further into that list, the more hardened/disciplined the way of making should be. From "anything goes" in the beginning to "no exceptions whatsoever" in the end.

[1] https://www.researchgate.net/publication/234820806_Crystal_c...

Comment by renewiltord 10 hours ago

To be honest, I’m not going to take advice from the guys who have to reboot their machines every 30 days or they won’t work.

Comment by superxpro12 9 hours ago

well if the project manager would have written that requirement maybe they would have got longer uptimes!

Comment by VoodooJuJu 9 hours ago

[dead]