Bit flips: How cosmic rays grounded a fleet of aircraft

Posted by signa11 7 days ago

Comments

Comment by chris_va 3 days ago

I highly recommend finding a cloud chamber (various science museums have them) to visualize just how much radiation is flying around.

Part of my work touches high power switches. I am going to do a bad job relating this story, but one of the power engineers was talking about how electric train switches in EU (Switzerland?) were having triggering issues. These were big MW scale IGBTs, not something you want to false trigger. Anyway, they eventually traced the problem to cosmic rays, and just turned the entire package vertical so the die was end-on to space (the mountains around were shielding the horizontal direction), and the problem went away.

Comment by Neywiny 2 days ago

Always good to support the IGBT community

Comment by bloomingeek 2 days ago

LMAO!

Comment by SkiFire13 2 days ago

> just turned the entire package vertical so the die was end-on to space (the mountains around were shielding the horizontal direction), and the problem went away.

That's a pretty cool solution! For some reason I was expecting something a lot more elaborated

Comment by actionfromafar 3 days ago

That's very P. K. Dick. Or maybe more Heinlein.

Comment by CamperBob2 2 days ago

Actually it's very "We actually have no idea what's causing this 50 kW load switch to flake out, but turning it on its side seems to help."

Comment by RankingMember 3 days ago

It's important to note that this is just Airbus's best guess as to the cause, as there's no smoking gun: they simply exhausted their troubleshooting and were left scratching their heads so this was the "least unlikely" cause they could come up with given the circumstances.

Comment by RealityVoid 3 days ago

I thought the same, but in a deeper dive into the postmortem, I think it's not a cop out from their side. The report is actually really well done ( I personally was impressed). The reasons it probably was a bit flip is that the CPU did not have edac on it in this instance so bit flips are expected. The consensus mechanism failed in this case and that is what they are updating, because even though the module gave wrong data because of presumably bit flips, the consensus should have prevented the dive.

Comment by RachelF 2 days ago

I would argue that designing avionics without EDAC is negligent design by Airbus.

Most modern servers at least implement ECC on their RAM. I would expect flight electronics to be designed to a higher standard.

Comment by 15155 2 days ago

Multi-module consensus is a form of EDAC - it's exceptionally unlikely that multiple units will fail identically simultaneously.

Comment by skylurk 2 days ago

Sure, until management sells a version with reduced redundancy to Ethiopia and Indonesia. Swiss cheese model and all that.

Comment by ahartmetz 2 days ago

>version with reduced redundancy

Not going to happen. The potentially huge cost to their reputation alone makes it not worth it, the modification would cost money and make logistics more difficult, and the plane couldn't be used (or sold) worldwide anymore.

Comment by skylurk 2 days ago

I think you are being sarcastic, but just in case:

https://www.nytimes.com/2019/05/05/business/boeing-737-max-w...

https://www.boeing.com/737-max-updates/mcas/

Comment by serial_dev 2 days ago

The links are Boeing and this article and thread are about Airbus.

Two different companies.

Boeing had tons of failures recently, flight search services started adding filters for the airplane because people didn’t want to fly with them.

Airbus is doing better for now, hopefully it will stay that way.

Comment by skylurk 2 days ago

Sorry, I didn't mean to be taking shots at any airplane company. I just disagree that multi-module consensus is a reliable form of EDAC. I gave a human factor example, but there are technical reasons too.

Comment by RealityVoid 2 days ago

> I just disagree that multi-module consensus is a reliable form of EDAC.

I wonder why you disagree about this? The only reason I can thing of is: - same sw with same hw with same lifecycle would probably have the same issue. (vendor diversity would fix this) - The consensus building unit is still a possible single point of failure.

Any other reasons you might doubt it as a methodology? It seems to have worked pretty well for Airbus and the failure rate is pretty low, so... It obviously is functional.

Modern units I'm sure have ECC, AND redundace as well.

Comment by skylurk 1 day ago

Yes exactly, birds of a feather fail together... an A380 has three primary flight control computers, but still carries another entirely dissimilar set of three flight control computers as backup.

Comment by RealityVoid 1 day ago

Well, the diversity would cover the issue with random HW failures, not the case your SW has a bug in it. As to the SW, they _sometimes_ have vendor diversity.

Regardless, there are multiple fronts you need to tackle to have high reliability so you should use all techniques at your disposal.

Comment by p_l 1 day ago

Until relatively recently, ECC on server RAM was because of chip failures and to lesser extent electro magnetic interference.

Good part selection and different EMI environment meant the calculated risks from not having ECC were considered too low to care and the idea that they might have to deal with radiation outside of flying near nuclear explosion arrived after the specific devices got designed.

Comment by thegrim33 2 days ago

Isn't a major feature of consensus algorithms for them to be tolerant to failures? Even basic algorithms take error handling into account and shouldn't be taken out by a bit flip in any one component.

Comment by RealityVoid 2 days ago

Yes. To clarify, my understanding of _this_ particular incident was wrong because it was based on reading the report of a previous incident.

But for the 2008 incident I read and linked the report, that was what happened. The ADIRU unit did probably get a SEU event and that should have been mitigated by the design of the ELAC unit. The ELAC unit failed to mitigate it so that's the part that they probaby fixed.

Comment by N19PEDL2 3 days ago

Do you happen to have a link to that report?

Comment by RealityVoid 3 days ago

Sure.

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

My reaction was initially that it was a cop out, but looking a bit in the report and thinking things through, I think that, yes, it's most likely a bit flip.

Comment by meatmanek 2 days ago

This is for a similar incident that happened in 2008, not the Jetblue incident from October of this year.

Comment by RealityVoid 2 days ago

Oh my god, you are correct. I read the technical details and did not bother to check it's the same issue. I am mortified. Apologies.

Comment by DecentShoes 3 days ago

Just like that Mario 64 speedrunner! People say it's like it's gospel, but it's really just a bunch of peoples best guess. No proof.

Comment by serial_dev 3 days ago

…but if I respond with this to a user’s bug report, I’m “not taking this seriously”…

Comment by avazhi 3 days ago

“ The increasing reliance of computers in fly-by-wire systems in aircraft, which use electronics rather than mechanical systems to control the plane in the air, also mean the risk posed by bit flips when they do occur is higher.”

Bit of an understatement. I don’t think there any active passenger airliners in the first world today that aren’t fly-by-wire. The MD-80 was the last of its kind and it’s been out of passenger operation for what, 10 years now?

Comment by Stevvo 3 days ago

Any Boeing other than 777/787 does not use fly-by-wire.

However, that doesn't illuminate the possibility of these errors. Whilst the flight-controls are mechanically linked, the autopilot/trim is electric, so is still suspectable to bit-flips.

Comment by drob518 3 days ago

Still a lot of software involved in controlling the aircraft. The 737 Max incidents were eventually tracked to software quality issues, IIRC. All those old designs are being upgraded with modern avionics, so even if the airframe and linkages are old-school, the inputs are being driven by digital computers. At least that’s my understanding. I confess to not being a “plane guy,” though I have spent a lot of time traveling in planes, and I have stayed at a few Holiday Inn Express hotels.

Comment by buckle8017 2 days ago

The MAX issues are not software.

The plane is fundamentally unstable because of the huge engines (which they have to improve fuel efficiency).

The only way to correct that is software with an angle of attack sensor.

They only installed one sensor though.

Does that sound like a software error or a fundamental physical design flaw?

Comment by xeonmc 2 days ago

From what I've read, the plane was not unstable, it just handles different, but stable; pilots just need to do the aircraft-specific retraining to as they usually do whenever you encounter different aircrafts with different handling characteristics.

Boeing wanted to pretend there is difference at all, to skip on retraining.

Comment by dehrmann 2 days ago

> The plane is fundamentally unstable because of the huge engines

I'll leave the googling to you, but this isn't true. The plane isn't fundamentally unstable, and certainly not like a modern fly-by-wire fighter.

Comment by Stevvo 2 days ago

The 737 Max is unstable in the pitch axis. There is no debate about that.

It might help to read what aerodynamic instability actually means before making such a claim: https://en.wikipedia.org/wiki/Stability_derivatives

Comment by buckle8017 2 days ago

Outside of the typical flight envelope it absolutely is like a fly by wire fighter.

That's why there's an angle of attack sensor, to keep the plane outside of that failure range.

Comment by SoftTalker 3 days ago

Boeing 717 is still in service and it's essentially an MD-80. Many 737s are in service and flight controls are hydraulic-boosted cable-and-pulley operated; the type design dates to the 1960s.

Comment by BurningFrog 3 days ago

Don't passenger aircrafts have redundant systems, so if one computer flips, the backup takes over?

Comment by RealityVoid 3 days ago

Not to mention, the system affected by the bit flips were designed in the 90's AND newer designed systems have EDAC so they are not susceptible to the same kind of issue. Honestly, if you look into the thing, the press coverage of the event is atrocious.

Comment by 3 days ago

Comment by neko_ranger 3 days ago

I swear to god I've been got by cosmic rays modifying a bit before when my boot order changes for random reasons

Comment by charcircuit 3 days ago

I feel like using "Cosmic Rays" as a reason is equivalent to "Aliens". It makes for good clickbait so everyone is fast to point at it as the reason even if there is no reason to actually believe that the bitflip was due to cosmic rays.

Comment by 0manrho 2 days ago

> even if there is no reason to actually believe that the bitflip was due to cosmic rays.

What if there is reason to consider it as it is actually a known, proven, observable phenomenon, especially one with greater likelihood/intensity as you climb in altitude, like planes do, and that likelihood/intensity also scales with solar cycle intensity, which we are currently experiencing the peak of?

Or perhaps you think the Aurora Borealis are because of Aliens too?

Comment by charcircuit 2 days ago

>also scales with solar cycle intensity

The article rebuked that claim, saying that day was average. There other things that can cause bitflips ti be more likely like heat.

Comment by 0manrho 2 days ago

> The article rebuked that claim

It did not. The article itself acknowledged that there is certainly reason to consider it a possibility, predicated on the fact that the people that make the thing stated as such and that experts in the field agree it's also a risk in general, but wasn't particularly high that day.

Average activity is not no activity. Average risk is not No risk.

And even if it wasn't the issue in that instance, it's not hard to reason why it's worth hardening against such a possibility in the absence of any other explanation given just days later "sensors mounted on UK weather balloons at 40,000ft (12km) measured one of the largest radiation events to hit Earth in roughly two decades."

Airbus didn't ground these plains because there was "No reason to believe" a known proven and observed phenomenon might have been the culprit and/or that it is on the level with something we as yet have no proof of to be generous in characterizing your comparing it to aliens.

Comment by XorNot 2 days ago

When you do Raman spectroscopy in a lab the software literally has an automatic cosmic ray rejection mode because for autonomy you are very likely to get cosmic ray initiated return signals over the course of a couple of hours.

"If the signal looks amazingly strong but unexpected and sharp, it's probably a cosmic ray" was what I was trained for.

Comment by charcircuit 2 days ago

I should have clarified that I was talking about software bugs.

Comment by on_the_train 2 days ago

Thank you for bringing reason to this topic where everyone is losing their mind when it comes up. Cosmic rays are sexy, They're sciency, but they're not a good explanation when you actually run the math.

Random but flips are pretty much always bad hardware. That's what the literature says when you actually study it. And that's also what we find at work: we wrote a program that occupied most of the free ram and checked it for bit flips. Deployed on a sizeable fleet of machines. We found exactly that: yes there were bit flips, but they were highly concentrated on specific machines and disappeared after changing hardware.

Comment by ExoticPearTree 2 days ago

> I feel like using "Cosmic Rays" as a reason is equivalent to "Aliens".

This is actually a thing. Cisco had issues with cosmic radiation in some of their equipment a few years back. Same symptoms: random memory corruption, and when they would test the memory everything would check out, but once in a blue moon, the routers would behave erratically.

Comment by financetechbro 2 days ago

What evidence do you have that this wasn’t due to solar radiation?

Comment by charcircuit 2 days ago

In practice such an event is rare, and I would expect there to be enough shielding to avoid it from interfering with the electronics. The fact that there is no hard evidence is why it's hard to argue against this clickbait claim, since technically they could be right.

Comment by adrian_b 2 days ago

The cosmic radiation that reaches Earth's surface consists mainly of particles that cannot be stopped by thin shields (e.g. muons or other particles with very high energies), otherwise they would not have passed through the atmosphere.

So shielding is not a solution that can be applied in a vehicle. You need something like an underground bunker to be sure that no cosmic radiation can penetrate it.

The only reason that makes rare the events caused by cosmic radiation is that if those particles can pass through shields that means that in most cases they will also pass through the electronic devices without being absorbed and causing malfunction.

Comment by burnt-resistor 2 days ago

PSA:

0. Always use a) SECDED hardware ECC and b) checksums on network links and I/O everywhere.

1. When unable to 1.a), add (72,64) 8-bits Hamming code per 64-bits (or) N>2 redundancy copies on physically-separate silicon for critical data and code. This is a significant performance hit, but safety is more important in some uses. (Don't neglect the integrity and reliability of code storage, loading, and execution paths either.)

2. Consider using Space Shuttle high-availability, high-reliability "voting" of N identically-designed behavior, possibly different manufacturer system control elements.

Comment by Borrible 2 days ago

That reminds me of how the manufacturer's customer service department for my car some thirty years ago tried to convince me that the problems with the ignition electronics could also be caused by solar flares. Which could have been the case, of course, but then it would surely have affected other vehicle owners as well. Though, maybe the sun did shine just for me back then, you can never be sure, can't you. I briefly considered consulting an astronomer.

Comment by djmips 2 days ago

There was a funny story about how sun shining on a UV sensitive electronic component was the root cause of a mysterious failure that was time and day dependent.

Comment by who-shot-jr 2 days ago

The Universe is Hostile to Computers - https://www.youtube.com/watch?v=AaZ_RSt0KP8

Comment by SwiftyBug 3 days ago

I thought planes had insane redundancy exactly so stuff like that don´t happen. How can a bit flip cause the system that controls altitude to malfunction like that?

Comment by procflora 3 days ago

From what I've heard (FWIW), Airbus released a version of the software for one of the flight computers that removed SEU protections (hence grounding affected models until they could be downgraded to the previous version).

There was still hardware redundancy though. Operation of the plane's elevator switched to a secondary computer. Presumably it was also running the same vulnerable software, but they diverted and landed early in part to minimize this risk.

So not just redundancy but layers of redundancy.

Comment by willis936 3 days ago

Why would you ever expect one bit flip? You have a flip rate and you design your system to tolerate a certain bit flip rate. Assumptions made during requirements establishment were wrong and nature eventually let them know they had negative margin.

Comment by p_l 3 days ago

Possibility of bit flips from cosmic radiation only really came to fore in 1990s, and some aircraft and parts predate that.

Comment by 15155 2 days ago

Smaller semiconductor feature size greatly increases the likelihood of these types of errors.

Comment by p_l 1 day ago

For a long time ECC brought most of effect as hedge against failing silicon, and local EMI. Aviation had benefit of careful EMI designs and appropriately selected chips, so it was seen less of a benefit...

Comment by bdangubic 3 days ago

  if (cosmic_ray) {
     do_not_flip_bits()
  } else {
     flip_away()
  }

Comment by rjp0008 3 days ago

What if in the time between initialization of cosmic_ray to False, and the time this if statement executes, a legitimate cosmic ray flips the bool bit representing cosmic_ray?

Comment by sunrunner 3 days ago

This is a really good point and a common error in bit flip detection code. To avoid this kind of look-before-you-leap hazard the following is recommended:

    try {
        do_action()
    } catch (BitFlipError e) {
        logger.critical("Shouldn't get here")
    }

Ask-for-forgiveness as an error detection pattern avoids these kinds of errors entirely.

Comment by terminalshort 2 days ago

Simple! Make it an int.

  int cosmic_ray = 0
  if (bool(cosmic_ray)) {
     throw cosmicRayException()
  }

Comment by wavemode 3 days ago

ah, a classic TORTOF bug (time-of-ray, time-of-flip)

Comment by air7 2 days ago

Naive question, but can't this be solved with device-level error correction?

Comment by djmips 2 days ago

It could be like ECC RAM but you'd have to make new hardware.

Comment by MarkusQ 3 days ago

This is silly. Rapidly refreshing the data that was (presumably) flipped by a cosmic ray last time won't do anything to prevent an error in whatever it hits next time. Unless the theory is that cosmic rays are somehow more likely to hit these particular bits compared to all the millions (billions?) of others in the system...in which case I have a different objection.

Comment by RealityVoid 3 days ago

What is silly is media coverage of this. The error was in the ADIRU. They are updating the ELAC. The ELAC takes the decision based on multiple data streams from 3 ADIRU units and the issue being fixed is that it took the wrong decision. The ADIRU will probably continue having SEU but it will be fine.

Comment by AlotOfReading 3 days ago

Not all circuits are equally sensitive. The parts that are known to be sensitive or critical are protected by redundancies and error checking, which are probabilistic protection. You haven't completely eliminated the possibility of corruption, just made it incredibly unlikely. Refreshing your inputs is another form of probabilistic protection focused on mitigating the consequences.

Comment by MarkusQ 3 days ago

Why not ECC though? Unless this is a latched output of a robust system being held for use by another robust system I guess?

Comment by adrian_b 2 days ago

In another HN thread it was said that since several years there is in production a new model of the affected module that has ECC.

Despite that, most of the existing planes have an older model, which has not been upgraded.

Comment by AlotOfReading 3 days ago

ECC is one of the probabilistic protections I was talking about.

Comment by preommr 3 days ago

I had no idea this was a real thing - I always thought that xkcd comic[0] was just a random joke.

[0]https://xkcd.com/378/

Comment by nomel 3 days ago

It's literally one of the reasons ECC RAM exists.

Comment by aruametello 3 days ago

To dial up the weirdness, sometimes the solar flare activity has spikes (https://www.spaceweatherlive.com/en/solar-activity/solar-fla...) and these have a mild relationship with the odds of having "bitflips" in that timeframe.

we had a "historic bad solarweather" a bunch of years ago and i talked with a cyber cafe operator that "you could have more computers bluescreen on this week than usual".

to me it got really weird when he said later he really did, but honestly its 50/50 that could had been just incidental.

in another note there are some "rather intense" discussions when someone speedrunning a game gets a "unreproducible glitch" in their favor, some claim its a flaw from ageing dram hardware, but some always point that it could be a cosmic ray bitfliping the right bit. (https://tildes.net/~games/1eqq/the_biggest_myth_in_speedrunn...)

Comment by mikestew 3 days ago

I had no idea this was a real thing

Oooh, in that case I have another xkcd you might like, involving mint candies and soft drinks…

Comment by jessriedel 3 days ago

I thought some combination of error correction and redundant systems was already widespread in airplanes to prevent cosmic-ray induced errors. (GPT agrees.) What am I missing? I've read multiple articles on this, and none of them address the fact that the problem, at the level of detail described in the article, should have been prevented by technology available and widely deployed for decades.

Comment by pengaru 3 days ago

> GPT agrees

What do you think this adds? These things are sycophant confident idiots; they will agree and agree they're incorrect at the slightest challenge in the same interaction.

Comment by jessriedel 2 days ago

I'm quite aware of the limitations. That's why I bothered to post a comment. But it's definitely better to do due diligence by asking first, since many responses can then be checked. Mentioning it in the comment shows the effort, similar to "Google turned up nothing".

Comment by pengaru 2 days ago

"my sycophant agrees" simply isn't adding anything of substance

Comment by jessriedel 1 day ago

If that's your honest impression, it's incorrect and I urge you to spend more time working with frontier models.

Comment by RealityVoid 3 days ago

You're missing that the systems were designed in the 90's and they had no edac on them but instead relied on redundancy and a consensus system. The fact bit flips happened is not why they grounded the things and updated sw, they grounded them to address the consensus algorithm in the other CPU that did not get the bit flips.

Comment by jessriedel 2 days ago

Do you have a source on that? The current article describes the software very differently:

> In any case, the software updates rolled out by the company appear to be quick and easy to install. Many airlines completed them within hours. The software works by inducing "rapid refreshing of the corrupted parameter so it has no time to have effect on the flight controls", Airbus says. This is, in essence, a way of continually sanitising computer data on these aircraft to try and ensure that any errors don't end up actually impacting a flight.

Comment by RealityVoid 1 day ago

Yes, my understanding of this was wrong and based on reading the failure analysis of another issue that was related to the ELAC but the SEU failure happened in the ADIRU.

I take my analysis on this back, it was true only for the other incident. I can't edit my answers anymore now. Not sure what is going on with this failure, would love to read a detailed analysis report as the other one I went through.