Ghosts in the code: the near crash of Qantas flight 72

Qantas flight 72 sits on the tarmac in Learmonth after the emergency landing. Image source: The Daily Mail

On the 7th of October 2008, Qantas flight 72 was cruising high over the Indian Ocean on its way to Perth in Western Australia when it suddenly pitched down without warning. Before the pilots could figure out what was going on, it happened again — it seemed as though the plane had a mind of its own; that the computer at the heart of Airbus A330 had gone rogue. Although the pilots managed to make a safe emergency landing, the violent pitch downs injured more than 100 people, some of them seriously, and caused significant damage to the cabin furnishings. Investigators tasked with finding the cause traced the problem back to bad data put out by an onboard computer called the Air Data/Inertial Reference Unit, triggering a series of software malfunctions that culminated in an automatic 10-degree nose down elevator command during cruise flight. How was it possible that ghosts in the code could injure so many people and threaten to bring down a plane on one of the world’s safest airlines? The ultimate source of the problem proved elusive, but investigators believed that Qantas flight 72 holds valuable lessons regarding the type of safety risk that will become increasingly common as airplanes become more complex.

The Qantas A330 involved in the accident. Image source: Wikipedia

Qantas flight 72 was a regularly scheduled service with Australia’s national airline from Singapore to Perth, Western Australia. Operated by a wide body Airbus A330, the flight left Singapore at 9:32 a.m. local time with 303 passengers and 12 crew on board, headed south across the Indian Ocean. In command were Captain Kevin Sullivan and First Officer Peter Lipsett, both of whom had more than 10,000 flying hours. A third pilot, Second Officer Ross Hales, was also flying along so that the pilots could alternate rest breaks during the flight. About half way through the journey, First Officer Lipsett gave up his seat to Second Officer Hales and went on his rest break. It was 12:39 p.m.

Deep in the A330’s avionics bay, a fault appeared in a device called the number one Air Data/Inertial Reference Unit, or ADIRU 1 for short. The A330 has three ADIRUs, each of which is connected to an independent set of sensors that measure a wide range of parameters, including airspeed; altitude; and angle of attack (AOA), the measure of the pitch angle relative to the airflow.

The ADIRUs process that information and feed it to the flight computers in the form of 32-bit “words” encoded in binary. Each “bit” is a unit of information with two binary states, one or zero, that are assigned different meanings depending on their position in the 32-bit word. A word sent from the ADIRU to the flight computer contains an 8-bit label that signifies what type of information is being conveyed (airspeed, altitude, etc.); a 2-bit source/destination identifier that signifies where the information is coming from and where it is going; up to 19 bits of actual measured data; a 2-bit status indicator that signifies whether or not the data is valid; and a 1-bit parity indicator that causes the destination computer to reject the word if it contains the wrong number of zeroes and ones.

ADIRU Data Export Format. Image source: the ATSB

Of particular interest is the 19-bit data section. Each bit in the 19-bit sequence is assigned a particular number, always twice the preceding number, that changes depending on the parameter that is being measured. For example, in the altitude parameter, bit #12 is always one foot, bit #13 is always two feet, bit #14 is always four feet, and so on. An altitude value is encoded as a sum of these numbers; the numbers used in the sum are indicated by changing the binary value of the associated bit from zero to one. For example, flight 72’s cruising altitude of 37,012 feet can be indicated with a binary value of one on bits #27 (32,768 feet), #24 (4096 feet), #19 (128 feet) and #15 (8 feet), with all other bits in the data section set to a binary value of zero.

How angle of attack is encoded. Image source: the ATSB

What exactly happened inside ADIRU 1 on board flight 72 at precisely 12:40 p.m. is unknown to this day. But while the triggering event is a mystery, the effect it had on the data being put out by this ADIRU was remarkable. As soon as the error occurred, the ADIRU started to send out bursts of mislabeled data — data where altitude information possessed the 8-bit label sequence corresponding to airspeed or AOA. Because the exact value of the data encoded in the word depends on what type of data it’s labelled as, the information became corrupted. The particular bits that were set to a binary value of one to sum to the aircraft’s altitude remained set as such, but now represented the corresponding number in a different parameter. Consider the previous example with a measured altitude of 37,012 feet. To sum to 37,012 feet, bits #27, #24, #19, and #15 were assigned a binary value of one. However, on the scale used for AOA data, those exact same bits corresponded to values summing to a total of 50.625 degrees.

How altitude data was encoded as angle of attack. Image source: the ATSB

As soon as the error occurred, ADIRU 1 started intermittently sending this mislabeled data to the flight computers. But this wasn’t the only problem. Some of the false data was used as a reference point to calculate the next batch, corrupting future “words” as well. Some parameters that relied on the corrupted parameters were thus corrupted themselves, and so were the periodic “status reports” put out by the ADIRU, which indicated whether various systems were working or not. Although no single mechanism that would conclusively explain all the types of corrupted data was ever found, the origin of the problem might have been the ADIRU CPU making errors when reading values stored in its random access memory.

The built-in feature that labeled data as valid or invalid didn’t catch the problem because the corruption occurred during the word assembly process, after the checks were performed. Much of the corrupted data also passed additional checks, or those checks failed; for example, the computer always checked AOA data to ensure that it was compatible with the plane’s measured airspeed and pitch angle. But because those parameters were also corrupted, the check couldn’t function.

Examples of checks that failed to catch the bad data. Image source: the ATSB

On the other end, the computer received data from all three ADIRUs, including the two that were functioning normally, and constantly compared their outputs to ensure consistency and detect false data. Over every one-second period, the computer made 25 comparisons of the AOA values put out by the three ADIRUs, calculated the median value at every sampling interval, and discarded AOA data from any ADIRU whose outputs were consistently too far from the median over the course of the one-second period. In the event that an AOA value differed significantly from the median at the beginning of the one-second interval, the computer would “remember” the last valid data sent from that ADIRU and use that in its calculations for 1.2 seconds before sampling again. But there was a hidden flaw in this process. If a “spike” of bad AOA data occurred at the beginning of the one-second comparison period, disappeared, and then returned within 0.2 seconds after the end of the comparison period, the 1.2 second memorization period would be triggered but the computer would not reject the ADIRU’s AOA outputs because they were not invalid over the whole one-second period. Then when the memorization period ended and the computer re-sampled the data, the output was invalid again, but it would be treated as valid because the output had just successfully passed the comparison test. The computer in effect assumed that if the test had passed, whatever value it received after the end of the test was necessarily valid, and used this value in its next calculation of the plane’s actual angle of attack. By this method, the flood of bad data from ADIRU 1 (and of particular note, the bad AOA data) made it through every single protection meant to filter it out. The bad data was thus used by the flight computer in its calculations.

How the data got past the computer’s comparison check of the ADIRU readings. Image source: the ATSB

In the cockpit, the pilots noticed the effects of the bad data within seconds of its creation. First, the autopilot disconnected as it proved unable to reconcile the differences in the data it was receiving from the three ADIRUs. Captain Sullivan immediately announced that he had manual control. Less than five seconds later, the pilots found themselves bombarded by a sudden cascade of warnings triggered by the mislabeled and corrupted data. Fault messages flooded onto the computer screen in the central console, and the “stall” and “overspeed” warnings both started going off intermittently — an obviously impossible combination, considering that one indicated they were flying too slow and the other indicated they were flying too fast!

Captain Sullivan tried engaging the A330’s second, backup autopilot. At the same time, the airspeed and altitude values on Sullivan’s flight display, which sources its data from ADIRU 1, appeared to go haywire, fluctuating wildly in a manner completely inconsistent with the aircraft’s level and docile trajectory. A fault message and warning light associated with the number one inertial reference unit (part of ADIRU 1) also went off. In response to the unreliable airspeed indications, Sullivan switched the autopilot back off and flew the plane manually using the standby instruments on the center console. Utterly baffled by the cascade of apparently false warnings, Captain Sullivan and Second Officer Hales called First Officer Lipsett back to the cockpit to help figure out what was going on.

But before Lipsett made it to the cockpit, the sequence of events unfolding in the realm of information suddenly broke through into the real world. A spike of altitude data mislabeled as AOA data and marked as valid by the flight computer triggered two separate emergency conditions of the A330’s so-called alpha floor protections. Alpha floor protections, a central part of Airbus’ design philosophy, are limits imposed on the pitch, angle of attack, airspeed, and bank angle that will trigger automatic corrective actions when exceeded. These protections normally prevent pilots from making control inputs that could put the plane into a dangerous attitude, and correct a dangerous attitude if one develops. But the faulty data incorrectly triggered two of the alpha floor protections even though the aircraft was in a normal attitude for cruise flight. A system called “high AOA protection” detected an excessively high angle of attack (sourced from the faulty ADIRU 1) and applied a 4-degree nose down elevator input, the maximum it could command, to help bring the AOA back within limits. At exactly the same time, the same bad data triggered a separate system called “anti-pitch up compensation” that is intended to counteract the A330’s tendency to pitch up when flying at a high speed and high angle of attack. This system applied a 6-degree nose down elevator input, which also happened to be the maximum it could command. The two nose-down commands were additive, together applying a sudden 10-degree nose down elevator movement.

The two control mechanisms triggered by the bad AOA data. Image source: the ATSB

The effect of a 10-degree nose down command while in cruise flight was sudden and catastrophic. The plane entered an immediate dive, flinging into the ceiling anyone and anything that wasn’t tied down. At least 60 seated passengers weren’t wearing their seat belts, and the negative G-forces slammed them head-first into the passenger service units on the bottom of the overhead bins. Several others, including most of the crew and some 20 passengers, were out of their seats carrying out various duties or making their way to the toilets. They too found themselves thrown against the ceiling with great force. Luggage compartments burst open, spilling suitcases and backpacks into the aisles. Drinks, food, laptops, books, and other loose items flew in every direction.

Reenactment and simulation of the dive. Video source: Mayday

In the cockpit, the pilots were pulled up and out of their seats, restrained only by their lap belts. Captain Sullivan reached for his side stick to pull the aircraft out of the dive, but when he tried to bring the nose up, there was no response; the automatic systems had locked him out. He let go and then tried again. This time, because the data spike was over, the elevators responded and the plane started to level out.

As the negative G-forces subsided, everyone in the cabin who was pinned to the ceiling came crashing back down again. People slammed into the floor, the seats, and other passengers, falling back down amid a chaotic flurry of random objects. Still recovering from the shock of the upset, passengers and crew alike took stock of the situation. The violent maneuver had caused widespread injuries — there were broken bones, concussions, serious lacerations, and more. All of the flight attendants were injured to various degrees. One person broke a leg, several suffered serious spinal injuries, and many were bleeding profusely. First Officer Lipsett, who had been on his way to the cockpit, broke his nose.

Now back in control, Sullivan and Hales, who were not hurt, set about trying to clear all the error messages on the computer screen. The fault notifications affected a wide variety of systems, and many of them required no action, but the one that kept coming up again no matter what they did was the same “NAV IR 1” fault that they received earlier. And as they worked, stall and overspeed warnings continued to blare. Second Officer Hales made an announcement over the public address system calling for all passengers and crew to sit down and fasten their seat belts immediately.

Suddenly, another spike of bad AOA data made it through to the flight computer. Although the disconnection of the autopilot had changed the alpha floor logic, removing the high AOA protection, the anti-pitch up compensation system remained active and was triggered again. This time the dive wasn’t as steep and most people had fastened their seat belts, but some who had been injured or were trying to help others had not, and they were thrown into the ceiling again. Just like the first time, Sullivan’s initial efforts to pull up had no effect; and just like the first time, the resistance abated after several seconds and he was able to level the plane.

A sudden pitch down was one thing, but two sudden pitch downs was quite another. With all kinds of alarms going on and off in the background and new error messages appearing constantly, the crew were unsure what was happening and feared they could dive again at any moment. An immediate landing at Learmonth in Western Australia seemed like the best option.

Lipsett, despite his broken nose, at last made it to the cockpit and took over for Hales. He reported that there were injuries among the passengers as well. At this time, Sullivan noted that the automated stabilizer trim wasn’t working; the trim would have to be adjusted manually. The navigation equipment was also not functioning and they couldn’t interact with the computer interface at all. Sullivan declared a pan-pan-pan, one level short of a mayday, and informed controllers that flight 72 was headed to Learmonth with “flight computer problems.” After receiving word from the flight attendants that there were numerous broken bones, lacerations, and other injuries, he upgraded this to a full mayday and requested that ambulances meet the aircraft after landing.

The pilots flew the remainder of the flight in full manual mode, trying to ignore the constant spurious alarms that refused to turn off. First Officer Lipsett called Qantas maintenance in Sydney over the satellite communication system to try and get help to resolve the situation, but they were also unable to figure out what was wrong. However, the sudden pitch downs never returned, and flight 72 landed safely at Learmonth at 1:32 p.m.

Emergency services work on the plane after the emergency landing. Image source: news.com.au

All told, at least 119 of the 315 passengers and crew were injured, 12 of them seriously. The interior of the cabin was utterly trashed. Ceiling panels were broken, passenger service units destroyed, overhead bins wrenched out of alignment. Trash, food, blood, and spilled drinks littered the floor. And while the plane would fly again and no one was killed, many people suffered injuries that will be with them for the rest of their lives — all because of some “ghosts in the code.” Investigators with the Australian Transportation Safety Board had to ask: how could such a thing happen?

As it turned out, it wasn’t the first time this type of error had occurred. Another Qantas A330 had experienced a similar data problem in 2006, also off the coast of Western Australia. And in December of 2008, it happened again on yet another Qantas flight off Western Australia. Neither of these other two cases involved an uncommanded pitch down, but the failure mode of the ADIRU in all three incidents was similar, and two of them even involved the exact same ADIRU. The fact that these failures all occurred within a small geographical region seemed too strange to be a coincidence, but despite a variety of theories, and a call from the Australian and International Pilots Association to ban flights over the area, investigators could find nothing inherent to Western Australia that could have caused the malfunctions.

Damage to overhead bins and passenger service units. Image source: the ATSB

In fact, the ATSB was never able to conclusively find what caused the ADIRU to start sending out false and mislabeled data. Only one theory could not be ruled out: a Single Event Effect, or SEE for short. A SEE occurs when a high-energy particle from outer space, such as a neutron, strikes a computer chip and randomly changes a binary switch from one to zero or zero to one. If a SEE occurred at a critical location within the ADIRU CPU’s memory module, it could, just maybe, have triggered everything that followed. The ATSB was unable to find evidence to prove or disprove the theory, but the fact that the two ADIRUs that experienced this type of malfunction were close to one another in serial number suggested that there might have been some minute hardware flaw in that batch of ADIRUs that made them more susceptible to a SEE.

Damage to the ceiling in the aisles. Image source: the ATSB

What made the failure of the ADIRU dangerous was not that it failed per se, but that the invalid data passed through the many layers of cross-checks without being flagged as such. Had the data spikes been flagged as invalid at some point in the process, the computer would have disregarded them and the safety of the flight would never have been compromised. The investigation found a hitherto unknown failure mode in which data spikes occurring approximately every 1.2 seconds could trick the computer into thinking bad data was real. This was where the real safety problem lay. It might not be possible to prevent a few ones and zeroes from becoming corrupted every now and then, but if the layered protections couldn’t always detect the corrupted data, that represented a safety risk. Those protections were good — the ADIRU itself could weed out 93.5% of invalid data on its own before the computer even did its cross-checking — but this wasn’t enough to prevent a bit of mismatched code from injuring 119 people. In principle, however, the ADIRU remained completely safe. This type of failure occurred only three times in 128 million hours of service for this model of ADIRU, well within the probability zone that regulators consider “extremely remote.”

A man receives medical attention after the accident. Image source: the Sydney Morning Herald

One final angle that the ATSB pursued was the rate of seat belt usage among airline passengers. During the two in-flight upsets, unrestrained passengers crashed into the ceiling and into other passengers, causing injuries not just to themselves but also to others who were wearing their seat belts and otherwise wouldn’t have been hurt. While a few factors could be correlated with lower seat belt use, there was no universal reason why people chose not to wear them. Getting people to wear seat belts when the seat belt sign isn’t on is a challenge that airlines have grappled with for decades. Turning the seat belt sign on all the time isn’t a practical solution because people would grow complacent about its presence and ignore the sign at higher rates than before. Investigators decided that more research would have to be done to find the most effective ways to get around this paradox.

Damage to the aisle ceiling. Image source: NZHerald

In its final report, the ATSB wrote that the investigation was extremely difficult and touched on numerous areas where no air accident investigation had ventured before. The authors of the report were also keenly aware that the Qantas flight 72 incident could be representative of the sort of case that will become more and more common in the modern era. “Given the increasing complexity of [aircraft] systems,” they wrote, “this investigation has offered an insight into the types of issues that will become relevant for future investigations.”

Just days after the accident, Airbus issued a bulletin to all A330 operators instructing pilots to immediately shut off the indicated ADIRU when receiving a “NAV IR” fault. This advice might have prevented a similar accident in December of that year, when the pilots of Qantas flight 71 experienced an identical ADIRU malfunction but switched off the affected unit after just 28 seconds. Regulatory authorities worldwide re-issued this Airbus bulletin as an airworthiness directive, making it an official rule. Airbus also redesigned the logic used by the flight computer to verify AOA data, removing the possibility that well-timed data spikes could make it through the cross-check. And furthermore, Airbus began including novel ways of testing its data verification software, including testing with intermittent data spikes, which had not previously been attempted.

VH-QFA, the aircraft involved in the accident, photographed in 2018. Image source: Masakatsu Ukon

However, the ATSB ran into a problem: although the event that precipitated this failure was so rare that the ADIRU still met all reasonable safety guidelines, it represented only one example of corruption within the vast quantities of information being processed inside an airplane’s many computers. What other loopholes might exist that could cause a software bug, a SEE, or other sources of bad data to manifest in dangerous ways? How could these events ever be predicted?

One way was to tackle one of the suspected sources of errors: SEEs. After the Qantas accident, the European Aviation Safety Agency started asking manufacturers of aircraft computers to take into account SEEs during the design phase to make their products less susceptible. At the time of the report’s publication, the US Federal Aviation Administration was still researching the best ways to approach the problem. Today, understanding of the safety implications of this phenomenon is still developing. Nevertheless, Qantas flight 72 stands out as the first case where investigators delved deeply into a serious software failure — and serves as a reminder of the importance of keeping one’s seat belt fastened at all times.

______________________________________________________________

Join the discussion of this article on reddit here!

And don’t forget to visit r/admiralcloudberg for over 100 similar articles.

Analyzer of plane crashes and author of upcoming book (soon™). Contact me via @Admiral_Cloudberg on Reddit or by email at kylanddempsey@gmail.com.