The Dark Side of Logic: The near crash of SmartLynx Estonia flight 9001
On the 28th of February 2018, two pilots, four trainees, and an inspector departed Tallinn, Estonia for a routine training flight aboard an Airbus A320. But as the last student pilot carried out a touch-and-go landing, the plane was wracked by a series of confusing mechanical failures. The pilots lost all control over the elevators; the plane lost height and scraped the runway on its engines before careening back into the air. The elevators were jammed, the other flight controls were degraded, the flaps weren’t working, the right engine was on fire. Every conceivable warning blared simultaneously. Faced with impending catastrophe, the pilots used what little control they had to line up with the runway for a desperate emergency landing. On final approach both engines failed, but by squeezing out the last dregs of speed, the captain managed to nurse the plane to the very threshold of the runway, where it touched down hard and rolled out in the snow. Through nerves of steel, all aboard had been saved. But what had led to this near-disaster? The answer was a surprise to everyone: there was nothing wrong with the flight controls. The sequence of events actually began when someone used the wrong oil to lubricate an obscure piston deep inside the horizontal stabilizer — a subtle mistake which led to an escalating series of computer failures that nearly cost seven people their lives.
SmartLynx Airlines Estonia is a wholly-owned subsidiary of SmartLynx Airlines, an independent Latvian carrier specializing in charter flights from the Baltic states to holiday destinations. As the Estonian branch of the company, SmartLynx Estonia operates a fleet of three Airbus A320s based out of the country’s main airport in Tallinn, the capital city.
But SmartLynx Estonia flight 9001 was not scheduled to leave the country: in fact, it wouldn’t even get far from Tallinn. The purpose of the flight was to give student pilots working their way through the airline’s in-house A320 training program practice with takeoffs, landings, and go-arounds. During the approximately three-hour session, four trainee pilots with limited flying experience would each take off, perform a go-around, execute five touch-and-go landings, then land for real, under the supervision of an instructor. Also on board were a safety pilot, qualified to fly the A320 in case of an emergency, and an inspector from the Estonian Civil Aviation Authority, who was there to monitor SmartLynx Estonia’s training program. The plane they were flying was 18 years old, but it was new to the airline: SmartLynx had purchased it earlier that month.
Under the watchful eye of the instructor — a 63-year-old veteran pilot with over 24,000 flight hours — the first three trainees carried out their exercises without incident. There was only one small hiccup, an annoying caution message that kept coming back over and over again: “ELAC 1 PITCH FAULT.” The pilots knew this meant there was something wrong with one of the plane’s computers, but the manual said all they needed to do was turn the computer off and back on again, so that’s what they did.
The Airbus A320 is a fly-by-wire aircraft, meaning that pilot control inputs are fed to a bank of computers, which in turn augment those inputs before commanding the control surfaces to move. This allows the plane to fly more smoothly, helps the pilots extract maximum performance and efficiency from the airplane, and prevents them from making inputs that could lead to a loss of control. To ensure redundancy, each set of control surfaces is attached to a different computer, each of which has multiple backups that can kick in if the primary computer fails. Among these computers are the two ELACs, short for Elevator Aileron Computers, which transmit inputs from the pilot and autopilot to the elevators, ailerons, and horizontal stabilizer.
The ELACs are part of a multi-layered system intended to ensure the integrity of the flight controls at all times. Normally, ELAC 2 is responsible for the elevators, but if it encounters a problem, this responsibility can be transferred to ELAC 1. If both ELACs fail, still not all is lost! At this point a different pair of computers, called the Spoiler Elevator Computers or SECs, which normally control the state of the aircraft’s spoilers, can step in to control the elevators as well. Thus, to lose control over the elevators, four computers must fail at the same time. It’s easy to see why the crew of flight 9001 were not worried when the airplane kept telling them about a problem with ELAC 1.
Unknown to anyone on board, the caution message that they were getting was only the tip of the iceberg. The origin of the problem lay with the Trimmable Horizontal Stabilizer Actuator (THSA). This is the hydraulic actuator which physically moves the horizontal stabilizer, the control surface which the pilots and autopilot use to adjust the angle at which the plane is stable. As opposed to the elevators, which are meant for one-time inputs, the stabilizer trim is for longer-term adjustments that ensure the plane stays at the pitch angle where the pilots or the autopilot want it. Also unlike the elevators, the stabilizer trim is connected mechanically to the pilots’ flight controls, and can still be used even without any functioning computers.
Whenever flight 9001 came in for a touch-and-go landing, the wheels would touch the ground, and after five seconds the THSA would enter what is known as the “ground setting.” In order to make the landing rollout easier, upon entering the ground setting the ELAC trims the stabilizer in the nose down direction, assisting the pilot in planting the aircraft on the runway. But in a touch-and-go landing, where the plane will remain on the ground for only a few seconds before taking off again, this automatic adjustment to the stabilizer is undesirable. Thus, in order to assist with the imminent takeoff, the instructor would use the manual trim wheel in the cockpit to override the computer and hold the stabilizer in a nose up position for takeoff. Using the manual trim wheel disconnects the pitch trim actuator (PTA), which sends electrical signals from the computers to the THSA, and in its place connects a device called the override mechanism, which inserts itself downstream of the PTA and transfers pilot inputs to the THSA instead of computer inputs.
Meanwhile, the ELAC constantly compares the commanded stabilizer position with its actual position in order to detect any mechanical failure in the system. It accomplishes this by matching values from two data channels: the command channel, which transmits the commanded position, and the monitoring channel, which transmits the actual position. If a discrepancy between the two lasts more than one second, a fault is triggered. In its default state, this very simple function would be unable to tell the difference between a pilot overriding the autopilot using the trim wheel and failure of the horizontal stabilizer, because the computer will continue to send signals to the PTA despite the fact that it is no longer connected to the stabilizer. To rectify this, whenever the pilot applies torque to the trim wheel in the cockpit, an override piston inside the PTA moves downwards and contacts three microswitches, which transmit a signal telling the computers to stop comparing the command channel with the monitoring channel.
However, this system on this particular A320 contained a tiny flaw: the oil used on the override piston was twice as viscous as the oil called for in the manual. As a result, the friction on the override piston was too high, and it would sometimes fail to extend far enough to contact the microswitches. This meant that during many of the touch-and-go landings, the instructor would override the computers by using the manual pitch trim wheel, but the microswitches wouldn’t make contact with the piston, and the computer would continue comparing the command and monitoring channels. Detecting that its commands were not affecting the stabilizer position, ELAC 1, which normally controls the stabilizer, would register a fault and shut down. Its duties would be transferred to ELAC 2, and an “ELAC 1 PITCH FAULT” caution message would appear on the screen between the two pilots.
Because the failure of one ELAC is considered an advisory matter without major safety implications, this message would only appear after the plane climbed through 1,500 feet after each touch-and-go landing, so as not to distract the crew during takeoff. Every time this message appeared, the pilots simply turned ELAC 1 off and back on again and the problem would go away. But on one of the touch-and-go landings with the third trainee, the pilots failed to notice the caution message and never reset ELAC 1.
As the fourth trainee came in for their third touch-and-go some minutes later, ELAC 1 was still off, and the stabilizer trim was being controlled by ELAC 2. As the instructor grasped the trim wheel to prevent the stabilizer from moving nose down, the same sequence of events happened again: the microswitches didn’t connect, the computer detected a discrepancy between the commanded position and the actual position, and ELAC 2 registered a fault and shut down.
With both ELACs inoperative, control over the elevators and the horizontal stabilizer was transferred to SEC 1 (Spoiler Elevator Computer 1). But a remarkable coincidence was about to plunge everyone on board into much greater danger.
The main role of the SECs is to facilitate the automatic deployment of the spoilers upon touchdown. The spoilers are panels that lift up from the wings to reduce lift and force the plane into the runway. In order to determine whether the spoilers should be deployed, the SECs constantly monitor sensors in the landing gear that detect whether the plane is on the ground or in the air. Although the pilots did not arm the spoilers for the touch-and-go landings, the SECs all the same continued to monitor these sensors.
The SECs get their landing gear information from a pair of computers called the Landing Gear Control Interface Units (LGCIUs). Like the ELACs, the SECs have a command channel and a monitoring channel in order to detect failures in either LGCIU. The command channel is supplied from LGCIU 1, while the monitoring channel receives data from LGCIU 2; thus, if one unit says the plane is on the ground while the other says it’s in the air, the SECs will detect this discrepancy and shut off, preventing the spoilers from automatically deploying in flight due to a faulty indication by one of the LGCIUs.
At almost the exact moment that control over the elevators and horizontal stabilizer was transferred to the SECs, the plane bounced off the runway and became airborne for approximately one second. By sheer coincidence, this event poked a hole in the logic used by the SECs to determine whether there was a fault with the LGCIUs. Although both LGCIUs send data about the plane’s air/ground status to the SECs every 120 milliseconds, they are not synchronized, meaning they don’t send this data at the same time.
If the SECs receive indications from both LGCIUs that the plane is in the air for a period of one second or more, the SECs switch to “flight mode.” Flight mode remains active for a minimum of 20 seconds. But because the command and monitoring channels do not sample data synchronously, it was possible for one channel to detect that the plane was airborne while the other channel did not, as long as the plane was in the air for only slightly more than one second — equivalent to about nine 120-millisecond sampling intervals, after rounding. Therefore, if the plane became airborne for, say, 1.15 seconds, it was possible for either nine or eight complete sampling intervals to fall within that 1.15-second period, depending on the exact timing of the intervals (for a more detailed explanation of this example, refer to the diagram below). So if the intervals line up such that the command channel detects an airborne status over nine intervals and the monitoring channel detects an airborne status over eight intervals, the command channel will see that the one-second threshold has been met and will switch to flight mode for 20 seconds, while the monitoring channel detects an airborne period less than one second and stays in ground mode. With one channel in flight mode and the other in ground mode, the SECs believe that there has been a failure of one of the LGCIUs, and they both shut off (since they both get their data from the same source).
This is precisely what occurred when flight 9001 bounced for about one second during the touch-and-go landing in Tallinn. With both ELACs already having failed, SEC 1 was responsible for the plane’s pitch control surfaces when it detected a flight/ground law miscompare and shut itself off. Control should have been transferred to SEC 2, but this computer had already shut down for the exact same reason. Remarkably, all four computers capable of controlling the elevators and horizontal stabilizer had now failed, and the elevators were locked in the neutral position!
At that moment, the trainee pilot attempted to pull back on his side stick in order to climb away from the runway, but to his surprise, the plane didn’t respond to his inputs.
Seeing that they were not climbing, the instructor commanded, “Rotate! Rotate!”
“I am rotating!” the student replied.
Suddenly, a pair of red warning messages appeared on the screen: «F/CTL L+R ELEV FAULT,” and “USE MAN PITCH TRIM ONLY,” along with a loud, continuous chime. These messages meant that both elevators had failed and the pilots would only be able to control pitch using the manual trim wheels, which were mechanically connected to the stabilizer.
As yet unaware of the nature of the problem, the instructor announced that he had control and attempted to pull back with the side stick, but he too was unable to effect any response. The plane began to climb ever so slowly, under the influence of the slightly nose up stabilizer position.
At this point the instructor might have briefly considered attempting to abort the takeoff, as he reduced engine power to idle for four seconds. But it quickly became apparent that they would not have enough room to stop on the runway, so he reversed his input, applying takeoff power and initiating the normal takeoff procedures, including retracting the landing gear and flaps.
Normally the loss of lift triggered by retracting the flaps after takeoff is countered by the fact that the plane is pointed nose up and is climbing with a high angle of attack. But in this case, with the plane flying almost nose-level, the retraction of the flaps in combination with the brief reduction in thrust was sufficient to put the plane into a descent. From its maximum height of 48 feet, the A320 slowly dropped to the ground until it struck the runway 200 meters before the end. As the landing gear was partially stowed at the moment of impact, the plane touched down on its engines, which scraped along the runway, throwing up a shower of sparks and causing major damage to critical engine accessories. Nevertheless, the plane’s speed continued to increase until it became airborne again two seconds later, at which point it rapidly pitched upward and began to climb at a rate of 6,000 feet per minute.
As the plane zoomed steeply upwards, the instructor tried to push the nose down, but again, his inputs had no effect. Simultaneously, several new warnings appeared, indicating that the flaps had jammed and engine #2 was on fire. The master caution alarm now began blaring on top of the continuous chime.
“Manual pitch trim only, manual pitch trim only!” the safety pilot exclaimed from one of the observer seats. Realizing that he needed to use the stabilizer trim to pitch down, the instructor grabbed the manual trim wheel and started turning it to the nose down position, but he overcorrected. The plane pitched sharply downward, entering a dive of 7,200 feet per minute from a height of just 1,500 feet above the ground. The ground proximity warning blared, “SINK RATE! TERRAIN! TERRAIN! WHOOP WHOOP, PULL UP,” further adding to the cacophony of alarms already filling the cockpit. As the rapidly approaching ground filled his windscreen, the instructor furiously cranked the trim wheel in the opposite direction, and the plane pulled out of the dive at a height of 596 feet, pulling 2.4 G’s as it swooped into another precipitous climb.
After climbing sharply to a height of 1,200 feet, the instructor finally managed to halt the rollercoaster ride using a combination of stabilizer trim, roll inputs, and engine thrust. Noting all the alarms, he asked, “Do we have engines?”
“We have engine two fire!” said the safety pilot.
The instructor shouted “Mayday, mayday, mayday,” but forgot to key his microphone to broadcast to air traffic control.
The safety pilot now began to read off the list of warning messages. “So we have flaps lock, flight control law, left/right elevator fault, maximum speed 320, manual pitch trim use, do not use speed brakes,” he said. As he listed off all the failures, the plane continued to oscillate between nose up and nose down as the instructor struggled to maintain control.
As the instructor nursed the plane into a 180-degree turn to return to the runway, the safety pilot called the tower and said, “Mayday, mayday, mayday, we have flight control fail!” The controller cleared them to make the right turn and approach the runway visually; meanwhile, the safety pilot took over the first officer’s seat, while the student and the ECAA inspector returned to the cabin.
The instructor pressed the master caution button to silence the loud alarm. “What is the heading of the runway?” he asked.
“262,” said the safety pilot. “Tallinn tower,” he added over the radio, “we are going for runway 26!” He also announced that they had an engine fire and requested that fire trucks respond to the plane after landing.
By now they were most of the way through the turn, coming in obliquely with the runway in sight. This was no matter; in a pinch, they could line up at the last moment.
The safety pilot suggested that they shut down the burning engine. In a move that might have saved lives, the instructor refused. “If I am losing an engine and manual flying,” he said, “I prefer to land when engines are working.” Since the damaged engine was still producing power, he figured the correct move would be to milk it for as much thrust as it could give.
72 seconds from touchdown. The safety pilot lowered the landing gear.
19 seconds later, engine two gave up the ghost and quit on its own. A fire alarm began sounding, and an “ENG 2 FAIL” warning message appeared on the screen. “Engine two is shut down!” said the safety pilot. But the instructor barely reacted. The engine failure didn’t matter — the runway was dead ahead, and all he needed to do was reach it.
33 seconds from touchdown. The damaged #1 engine also failed. The plane briefly lost all electrical power; the instruments went dark and the black boxes stopped recording. The remaining computers shut down, leaving the pilots unable to use any of the flight controls except the stabilizer trim and rudder, which had mechanical backups.
Seconds later, the Ram Air Turbine, or RAT, deployed from the bottom of the fuselage to power critical systems. Some of the flight controls came back online, and the cockpit voice recorder started recording again.
“Gear is down. We don’t have engines!” said the safety pilot. “Speed 150!” Their airspeed was dropping fast. “Speed 130! Speed 120!”
The instructor milked their remaining speed for all it was worth. Slowing the plane to barely 100 knots, far lower than the normal approach speed, he managed to glide it almost to the runway threshold. The plane touched down hard in the snow 150 meters short of the runway, throwing up a powdery white cloud. Amid the pop-pop-pop of bursting tires, the A320 skidded out onto the asphalt, rolled off the left side of the runway, and ground to a halt.
The impact was rough enough that the safety officer and one of the students suffered minor injuries, including a concussion, but no one was seriously hurt, and with the right engine possibly still on fire, the instructor ordered an evacuation. As fire trucks sprayed down the plane with foam, all seven people on board exited via the escape slides, walking away into the mid-afternoon twilight. Against all odds, they had saved their plane, and their lives.
As investigators from the Estonian Safety Investigation Bureau (ESIB) arrived at the scene, they found that the damage to the plane was much greater than anyone had previously realized. Both engines were damaged beyond repair after striking the runway. All the tires had burst and the wheel rims were worn flat on the bottom. The fuselage was warped and dented in multiple places, and there were indentations and punctures in the lower body skin and the wing fairings. The landing gear was extensively damaged, and the landing gear doors were in a sorry state — at least, those which were still attached to the plane. Some of the landing gear doors had actually separated in flight and were found several kilometers from the airport. In fact, the damage was so widespread that SmartLynx had to write off the plane as a total loss, and it was eventually sold to the German army to use in training simulations for the special forces.
Despite the extensive damage, however, investigators quickly found that all of it occurred as a result of the loss of control: prior to the unexpected impact with the runway on takeoff, there was no damage to the plane whatsoever. Tests showed that all the flight controls were working normally, and so were the computers that controlled them. Investigators eventually realized that the faults with the Elevator Aileron Computers (ELACs) corresponded to the points at which the instructor used the manual trim wheel to override the autopilot, which led them to the discovery that the override piston which was supposed to tell the computer that the pilot is in control was not consistently making contact with the microswitches. This in turn led to the discovery that the wrong type of oil had been used to grease the override piston, causing the piston to stick.
The origin of the incorrect oil proved impossible to trace. The last recorded overhaul on the stabilizer actuators occurred in the United States in 2017, but the records showed that the correct type of oil was used and that the override piston performed normally afterwards. Were the records wrong, or had someone else replaced the oil in the months since? The ESIB was never able to answer this question.
Nevertheless, it seemed remarkable that such a tiny error could lead to a near-disaster. Could the wrong oil on an obscure piston have really caused the failure of four computers, the loss of multiple flight controls, and the failure of both engines? Incredibly, the answer was yes.
The failure of the piston to make contact with the microswitches led both ELACs to erroneously read the instructor’s manual inputs as a fault with the stabilizer, causing one to trip off, followed by the other sometime later. When the SECs kicked in as the backup pitch control computers, a design flaw in their flight/ground law logic caused the computers to falsely detect a failure of one or more Landing Gear Control Interface Units when the plane bounced into the air for around one second. This caused both SECs to shut down simultaneously, leading to a loss of control over the elevators. Automatic centering units kicked in to lock the elevators in the neutral position, preventing them from making random, uncommanded movements, while warning messages informed the crew that they would need to control pitch using the stabilizer only. But before the pilots fully comprehended what was going on, the plane went into a descent which they could not figure out how to counter, causing the engines to strike the runway as the landing gear was in transit. The damage to the engines further complicated an already precarious situation.
Despite almost losing control while trying to figure out how to fly without elevators, the instructor pilot managed to wrangle the plane into something like level flight — thanks in part to timely callouts by the safety pilot. From that point onward, the instructor and the safety pilot worked together as an effective team, managing their priorities amid a sea of failures and warnings that might have overwhelmed a less experienced crew. The instructor’s 24,000 flying hours gave him the judgment he needed to decide what he should focus on and what he could ignore. Lining up with the runway from the usual distance? Not necessary. Shutting down the malfunctioning engines, per proper procedure? A bad idea, as long as they’re still generating power. He knew precisely where he needed to bend the rules to save his plane — and it’s a good thing he did, because if he tried to make a stabilized final approach while shutting down the malfunctioning engines, they wouldn’t have reached the runway, and the plane probably would have crashed into a forest.
Investigators also looked into why the training flights were continued despite the repeated “ELAC 1 PITCH FAULT” messages. They found that the manual did not provide a limit on the number of times this warning could appear before it would be considered prudent to stop flying. Therefore, the crew had no reason to believe that they couldn’t just keep turning the computer off and back on again indefinitely. Theoretically this was true, but when they forgot to reset it after a touch-and-go with the third student, a layer of redundancy was removed for a relatively long period. Furthermore, the fact that the caution messages were inhibited until the plane reached 1,500 feet prevented the instructor from making the connection between his trim inputs and the warning, lending the messages a certain abstractness that might have led him to take them less seriously. And finally, taking a plane out of service for a long period to conduct training flights is expensive, and he might have felt pressure to ensure that the training session was completed within the allotted timeframe.
As a result of the near-disaster, several changes were made. Airbus modified its procedures for training on touch-and-go landings, directing pilots to consult the minimum equipment list should a failure occur during such a flight, and that for this purpose each touch-and-go landing should be considered the start of a new flight. Airbus changed its procedures to instruct pilots not to reset the ELACs multiple times in flight, which in turn makes an ELAC pitch fault message reasonable grounds to halt flying activities until the problem is fixed. Airbus also improved the software design of the ELACs to mitigate the consequences of the type of failure that occurred on flight 9001, and corrected the design flaw which led to the shutdown of both SECs. SmartLynx Estonia made a number of changes as well, including that airplanes which have less than one year of Smartlynx maintenance records cannot be used for training flights, and that instructors must halt training flights if there is an error message relating to the flight controls (or any other critical system), along with emphasizing that commanders have the right to stop flight activity at any time if they feel that there is a threat to their safety.
The near crash of SmartLynx Estonia flight 9001 represents one of the most serious malfunctions of Airbus’s fly-by-wire technology since its introduction in 1988. At the same time, this sequence of events was wildly improbable. A dizzying number of specific events with specific timing had to come together to override all the layers of protection built into the A320’s advanced computer systems. Even without changing anything about the airplane, the odds are not necessarily in favor of this ever happening again.
Perhaps the best takeaway from this accident is just how safe modern airliners have become. Four flight control computers failed consecutively, both engines quit, and electrical power was lost. And yet, even after so many things went wrong, there were still more backup systems. The A320 was designed to be flyable using only the mechanical backups for the stabilizer trim and rudder in the event of the complete failure of all its computers. And in practice, it turned out that this worked. But we should not take for granted the true last line of defense: the pilots themselves. In the present day it is rare that enough failures occur to place the pilot’s skill as the only barrier between salvation and disaster. But on that cold afternoon over Estonia, it happened — a reminder of why we should still expect airline pilots to be the best of us, the ones who, when all layers of protection fall down defeated and death is on the line, can rise to meet the moment.
Join the discussion of this article on Reddit!
Visit r/admiralcloudberg to read over 180 similar articles.
You can also support me on Patreon!