Bit flips: How cosmic rays grounded a fleet of aircraft
Posted by signa11 7 days ago
Comments
Comment by chris_va 3 days ago
Part of my work touches high power switches. I am going to do a bad job relating this story, but one of the power engineers was talking about how electric train switches in EU (Switzerland?) were having triggering issues. These were big MW scale IGBTs, not something you want to false trigger. Anyway, they eventually traced the problem to cosmic rays, and just turned the entire package vertical so the die was end-on to space (the mountains around were shielding the horizontal direction), and the problem went away.
Comment by Neywiny 2 days ago
Comment by bloomingeek 2 days ago
Comment by SkiFire13 2 days ago
That's a pretty cool solution! For some reason I was expecting something a lot more elaborated
Comment by actionfromafar 3 days ago
Comment by CamperBob2 2 days ago
Comment by RankingMember 3 days ago
Comment by RealityVoid 3 days ago
Comment by RachelF 2 days ago
Most modern servers at least implement ECC on their RAM. I would expect flight electronics to be designed to a higher standard.
Comment by 15155 2 days ago
Comment by skylurk 2 days ago
Comment by ahartmetz 2 days ago
Not going to happen. The potentially huge cost to their reputation alone makes it not worth it, the modification would cost money and make logistics more difficult, and the plane couldn't be used (or sold) worldwide anymore.
Comment by skylurk 2 days ago
https://www.nytimes.com/2019/05/05/business/boeing-737-max-w...
Comment by serial_dev 2 days ago
Two different companies.
Boeing had tons of failures recently, flight search services started adding filters for the airplane because people didn’t want to fly with them.
Airbus is doing better for now, hopefully it will stay that way.
Comment by skylurk 2 days ago
Comment by RealityVoid 2 days ago
I wonder why you disagree about this? The only reason I can thing of is: - same sw with same hw with same lifecycle would probably have the same issue. (vendor diversity would fix this) - The consensus building unit is still a possible single point of failure.
Any other reasons you might doubt it as a methodology? It seems to have worked pretty well for Airbus and the failure rate is pretty low, so... It obviously is functional.
Modern units I'm sure have ECC, AND redundace as well.
Comment by skylurk 1 day ago
Comment by RealityVoid 1 day ago
Regardless, there are multiple fronts you need to tackle to have high reliability so you should use all techniques at your disposal.
Comment by p_l 1 day ago
Good part selection and different EMI environment meant the calculated risks from not having ECC were considered too low to care and the idea that they might have to deal with radiation outside of flying near nuclear explosion arrived after the specific devices got designed.
Comment by thegrim33 2 days ago
Comment by RealityVoid 2 days ago
But for the 2008 incident I read and linked the report, that was what happened. The ADIRU unit did probably get a SEU event and that should have been mitigated by the design of the ELAC unit. The ELAC unit failed to mitigate it so that's the part that they probaby fixed.
Comment by N19PEDL2 3 days ago
Comment by RealityVoid 3 days ago
https://www.atsb.gov.au/sites/default/files/media/3532398/ao...
My reaction was initially that it was a cop out, but looking a bit in the report and thinking things through, I think that, yes, it's most likely a bit flip.
Comment by meatmanek 2 days ago
Comment by RealityVoid 2 days ago
Comment by DecentShoes 3 days ago
Comment by serial_dev 3 days ago
Comment by avazhi 3 days ago
Bit of an understatement. I don’t think there any active passenger airliners in the first world today that aren’t fly-by-wire. The MD-80 was the last of its kind and it’s been out of passenger operation for what, 10 years now?
Comment by Stevvo 3 days ago
However, that doesn't illuminate the possibility of these errors. Whilst the flight-controls are mechanically linked, the autopilot/trim is electric, so is still suspectable to bit-flips.
Comment by drob518 3 days ago
Comment by buckle8017 2 days ago
The plane is fundamentally unstable because of the huge engines (which they have to improve fuel efficiency).
The only way to correct that is software with an angle of attack sensor.
They only installed one sensor though.
Does that sound like a software error or a fundamental physical design flaw?
Comment by xeonmc 2 days ago
Boeing wanted to pretend there is difference at all, to skip on retraining.
Comment by dehrmann 2 days ago
I'll leave the googling to you, but this isn't true. The plane isn't fundamentally unstable, and certainly not like a modern fly-by-wire fighter.
Comment by Stevvo 2 days ago
It might help to read what aerodynamic instability actually means before making such a claim: https://en.wikipedia.org/wiki/Stability_derivatives
Comment by buckle8017 2 days ago
That's why there's an angle of attack sensor, to keep the plane outside of that failure range.
Comment by SoftTalker 3 days ago
Comment by BurningFrog 3 days ago
Comment by RealityVoid 3 days ago
Comment by neko_ranger 3 days ago
Comment by charcircuit 3 days ago
Comment by 0manrho 2 days ago
What if there is reason to consider it as it is actually a known, proven, observable phenomenon, especially one with greater likelihood/intensity as you climb in altitude, like planes do, and that likelihood/intensity also scales with solar cycle intensity, which we are currently experiencing the peak of?
Or perhaps you think the Aurora Borealis are because of Aliens too?
Comment by charcircuit 2 days ago
The article rebuked that claim, saying that day was average. There other things that can cause bitflips ti be more likely like heat.
Comment by 0manrho 2 days ago
It did not. The article itself acknowledged that there is certainly reason to consider it a possibility, predicated on the fact that the people that make the thing stated as such and that experts in the field agree it's also a risk in general, but wasn't particularly high that day.
Average activity is not no activity. Average risk is not No risk.
And even if it wasn't the issue in that instance, it's not hard to reason why it's worth hardening against such a possibility in the absence of any other explanation given just days later "sensors mounted on UK weather balloons at 40,000ft (12km) measured one of the largest radiation events to hit Earth in roughly two decades."
Airbus didn't ground these plains because there was "No reason to believe" a known proven and observed phenomenon might have been the culprit and/or that it is on the level with something we as yet have no proof of to be generous in characterizing your comparing it to aliens.
Comment by XorNot 2 days ago
"If the signal looks amazingly strong but unexpected and sharp, it's probably a cosmic ray" was what I was trained for.
Comment by charcircuit 2 days ago
Comment by on_the_train 2 days ago
Random but flips are pretty much always bad hardware. That's what the literature says when you actually study it. And that's also what we find at work: we wrote a program that occupied most of the free ram and checked it for bit flips. Deployed on a sizeable fleet of machines. We found exactly that: yes there were bit flips, but they were highly concentrated on specific machines and disappeared after changing hardware.
Comment by ExoticPearTree 2 days ago
This is actually a thing. Cisco had issues with cosmic radiation in some of their equipment a few years back. Same symptoms: random memory corruption, and when they would test the memory everything would check out, but once in a blue moon, the routers would behave erratically.
Comment by financetechbro 2 days ago
Comment by charcircuit 2 days ago
Comment by adrian_b 2 days ago
So shielding is not a solution that can be applied in a vehicle. You need something like an underground bunker to be sure that no cosmic radiation can penetrate it.
The only reason that makes rare the events caused by cosmic radiation is that if those particles can pass through shields that means that in most cases they will also pass through the electronic devices without being absorbed and causing malfunction.
Comment by burnt-resistor 2 days ago
0. Always use a) SECDED hardware ECC and b) checksums on network links and I/O everywhere.
1. When unable to 1.a), add (72,64) 8-bits Hamming code per 64-bits (or) N>2 redundancy copies on physically-separate silicon for critical data and code. This is a significant performance hit, but safety is more important in some uses. (Don't neglect the integrity and reliability of code storage, loading, and execution paths either.)
2. Consider using Space Shuttle high-availability, high-reliability "voting" of N identically-designed behavior, possibly different manufacturer system control elements.
Comment by Borrible 2 days ago
Comment by djmips 2 days ago
Comment by who-shot-jr 2 days ago
Comment by SwiftyBug 3 days ago
Comment by procflora 3 days ago
There was still hardware redundancy though. Operation of the plane's elevator switched to a secondary computer. Presumably it was also running the same vulnerable software, but they diverted and landed early in part to minimize this risk.
So not just redundancy but layers of redundancy.
Comment by willis936 3 days ago
Comment by p_l 3 days ago
Comment by 15155 2 days ago
Comment by p_l 1 day ago
Comment by bdangubic 3 days ago
if (cosmic_ray) {
do_not_flip_bits()
} else {
flip_away()
}Comment by rjp0008 3 days ago
Comment by sunrunner 3 days ago
try {
do_action()
} catch (BitFlipError e) {
logger.critical("Shouldn't get here")
}
Ask-for-forgiveness as an error detection pattern avoids these kinds of errors entirely.Comment by terminalshort 2 days ago
int cosmic_ray = 0
if (bool(cosmic_ray)) {
throw cosmicRayException()
}Comment by wavemode 3 days ago
Comment by air7 2 days ago
Comment by djmips 2 days ago
Comment by MarkusQ 3 days ago
Comment by RealityVoid 3 days ago
Comment by AlotOfReading 3 days ago
Comment by MarkusQ 3 days ago
Comment by adrian_b 2 days ago
Despite that, most of the existing planes have an older model, which has not been upgraded.
Comment by AlotOfReading 3 days ago
Comment by preommr 3 days ago
Comment by nomel 3 days ago
Comment by aruametello 3 days ago
we had a "historic bad solarweather" a bunch of years ago and i talked with a cyber cafe operator that "you could have more computers bluescreen on this week than usual".
to me it got really weird when he said later he really did, but honestly its 50/50 that could had been just incidental.
in another note there are some "rather intense" discussions when someone speedrunning a game gets a "unreproducible glitch" in their favor, some claim its a flaw from ageing dram hardware, but some always point that it could be a cosmic ray bitfliping the right bit. (https://tildes.net/~games/1eqq/the_biggest_myth_in_speedrunn...)
Comment by mikestew 3 days ago
Oooh, in that case I have another xkcd you might like, involving mint candies and soft drinks…
Comment by jessriedel 3 days ago
Comment by pengaru 3 days ago
What do you think this adds? These things are sycophant confident idiots; they will agree and agree they're incorrect at the slightest challenge in the same interaction.
Comment by jessriedel 2 days ago
Comment by pengaru 2 days ago
Comment by jessriedel 1 day ago
Comment by RealityVoid 3 days ago
Comment by jessriedel 2 days ago
> In any case, the software updates rolled out by the company appear to be quick and easy to install. Many airlines completed them within hours. The software works by inducing "rapid refreshing of the corrupted parameter so it has no time to have effect on the flight controls", Airbus says. This is, in essence, a way of continually sanitising computer data on these aircraft to try and ensure that any errors don't end up actually impacting a flight.
Comment by RealityVoid 1 day ago
I take my analysis on this back, it was true only for the other incident. I can't edit my answers anymore now. Not sure what is going on with this failure, would love to read a detailed analysis report as the other one I went through.