The Accident to Qantas Flight 72, VH-QPA, in October 2008

21 12 2011

The Airbus A330-303 VH-QPA experienced uncommanded nose-down pitch commands while in cruise at FL370. Lots of unsecured people were thrown to the ceiling, and some were injured severely. The aircraft declared an emergency and landed as soon as practicable, at Learmonth, where the injured were treated and several hospitalised. It has been known for a while that the accident was caused by data anomalies from a air data computer (ADIRU) which were not filtered out by the primary flight control computers (FLight Control Primary Computers, FCPC, also known as PRIM). However, it has been a mystery – and remains so – how the anomalous data values were generated. It has happened three times: twice with the unit on VH-QPA, and once on another unit on another aircraft, also Qantas, also in Western Australia, within a couple of months of this incident.

The fix is apparently to modify the BITE test of the ADIRU specifically to look for such anomalies, and to modify the data-filtering algorithms of the Flight Control Primary Computers (FCPC, also known as PRIM) of the A330.

The Final Report is now available on the ATSB WWW site.

There was a note from Andrew Heasley in Risks 26-67 with a title saying the accident was “Blamed on Software“, pointing to a newspaper article. I find this claim misleading. The problem which arose had nothing to do with anything for which any software engineer would have been responsible.

The fixes were implemented in both SW and HW, but fixes to non-SW problems are very often implemented in SW.

The PRIMs ran a data-assurance algorithm for data received from three different ADIRUs, which are electronic boxes built by a different manufacturer. This data assurance algorithm had a specific vulnerability to spiky angle-of-attack (AoA) data presented in a particular time-sequential manner, which was exploited during the occurrence. The algorithm, which uses AoA data from three ADIRUs, filters out multiple data spikes from a unit which occur within a specific time frame. Spikes on the culprit ADIRU occurred with similar values just over the boundary of this time frame, and were thus taken as veridical by the PRIMs. The resolution algorithms for the AoA data (with that from the other ADIRU units) in the PRIMs let these values through, and the PRIMs reacted accordingly by commanding sudden nose-down pitch.

Responsibility for the design of such algorithms lies clearly with those who are experts on the engineering of electronic data generation and transmission equipment, not on any software engineers.

To give a similar example with which I been recently involved, it turns out that signals of certain frequencies in AC electric circuits can bypass the Type A and Type B circuit protection equipment (circuit breakers) that are required in most electric circuits (household and industrial) in Germany. A committee on which I sit has recently considered attaching equipment which is, as far as we know, theoretically capable of generating such frequencies to such circuits. A similar situation, how to handle anomalous signals, but no SW in sight. Pure electrical engineering.

Concerning my earlier note here on Certification Requirements for Commercial Airplanes, I find it interesting and commendable that the Bureau considered likelihoods of events in their summary (quoted below). However, I don’t believe they formulated it in quite the words I would have liked to have read.

They give reason to classify the event as “hazardous”, and with a fleet operating experience of 28 million flight hours this occurrence fits within the expected value (a technical term) of the operating time within which the effects of a hazardous event may occur (defined to be less than or equal to one occurrence within ten million operating hours), according to the acceptable means to determine compliance with certification criteria (now known as AMC 25). Notice it is not the event itself of which they assess the occurrence – that has occurred three times – but the deleterious effects upon safety of the event, which have only occurred once.

They speak of “certification requirements“. Strictly speaking, this is incorrect. The certification requirements are expressed in CS 25 and do not involve probabilities. The severity classification terms “catastrophic”, “”hazardous” etc and their associated acceptable/unacceptable frequencies occur in risk-matrix-type form in the Acceptable Means of Compliance document which accompanies the certification requirements (AMC 25), not the requirements themselves. (I note that these documents were called something slightly different at A330 certification time, 1993).

The certification requirements themselves are quite clear: the airplane shall behave in such-and-such a manner. If a wing falls off, or a flight control computer sends it into a loop, it is obviously not behaving in that manner; thus violating certification requirements. However, it is accepted that one cannot provide proof that such untoward things will never ever happen (will the sun rise tomorrow? Will your steering wheel come off in your hands? WIll your control sidestick come out of its holder in your hand?), so a less strenuous regime based on arguing likelihoods is defined as an “Acceptable Means of Compliance” with the regulations for purpose of certification.

This is not hair-splitting. It has consequences, in particular in this case, for how anomalies are dealt with, as follows.

If the requirement were that, say, “hazardous effects shall only occur on average once in between 10^7 and 10^9 operating hours“, which is what the AMC says you have to show to demonstrate compliance acceptably, then it would have been open to the manufacturer to do nothing in reaction to the QF72 event: the hazardous effects occurred only within the expected time value of their occurrence. If you think about it, it would also be open to a manufacturer to do nothing until the second occurrence of any hazardous or indeed catastrophic effects, even if the problem occurred first within the early experience of flying the aircraft! This is simply a consequence of the meaning of the probabilistic concepts used.

Whereas, as things now stand, separating requirements, which are absolute, from acceptable compliance (which may be based on occurrence frequency) any in-flight anomalous behavior must be fixed or the airworthiness certificate will be withdrawn. This is because such behavior violates the written requirements, that the aircraft shall not behave that way. To repeat, the conditions on behavior are absolute, not likelihood-based.

And that is how one wants things: The requirements are absolute, but it is accepted that in science and engineering you are often only convinced to some degree, so it is regarded as acceptable to argue your conviction up to a certain degree, and not to have to prove it, which would likely be impossible. But if something does go wrong, you want it fixed right away.

One can argue that any given set of occurrences is compatible with any probability requirement whatever, and thus that probabilistic requirements are inappropriate to determine airworthiness in any case. However, I don’t think such an argument works. Say these three events had occurred within 3 million operating hours, each with damage. One could estimate the likelihood that an piece of equipment fulfilling the condition of an expected value of at most once in 10 million operating hours to exhibit three events within 3 million operating hours. One would conclude that it is unlikely, say with small probability P. It follows that the situation that the aircraft fulfills the acceptable-compliance criterion has the same probability P. The small probability P that the aircraft acceptably complied with certification requirements would provide good reason for withdrawing the airworthiness certificate.

Concerning the data anomaly itself stemming from the ADIRU, its cause remains a mystery. The report says:


Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The report says that the manufacturer is developing a modification to the BITE to detect such failure modes:


Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Here is the executive summary. It is well and concisely written. I include the three paragraphs about seat belts and the investigative process for completeness.

Executive Summary

At 0132 Universal Time Coordinated (0932 local time) on 7 October 2008, an Airbus A330-303 aircraft, registered VH-QPA and operated as Qantas flight 72, departed Singapore on a scheduled passenger transport service to Perth, Western Australia. At 0440:26, while the aircraft was in cruise at 37,000 ft, ADIRU 1 started providing intermittent, incorrect values (spikes) on all flight parameters to other aircraft systems. Soon after, the autopilot disconnected and the crew started receiving numerous warning and caution messages (most of them spurious). The other two ADIRUs performed normally during the flight.

At 0442:27, the aircraft suddenly pitched nose down. The FCPCs commanded the pitch-down in response to AOA data spikes from ADIRU 1. Although the pitch-down command lasted less than 2 seconds, the resulting forces were sufficient for almost all the unrestrained occupants to be thrown to the aircraft’s ceiling. At least 110 of the 303 passengers and nine of the 12 crew members were injured; 12 of the occupants were seriously injured and another 39 received hospital medical treatment. The FCPCs commanded a second, less severe pitch-down at 0445:08.
The flight crew’s responses to the emergency were timely and appropriate. Due to the serious injuries and their assessment that there was potential for further pitch-downs, the crew diverted the flight to Learmonth, Western Australia and declared a MAYDAY to air traffic control. The aircraft landed as soon as operationally practicable at 0532, and medical assistance was provided to the injured occupants soon after.

FCPC design limitation

AOA is a critically important flight parameter, and full-authority flight control systems such as those equipping A330/A340 aircraft require accurate AOA data to function properly. The aircraft was fitted with three ADIRUs to provide redundancy and enable fault tolerance, and the FCPCs used the three independent AOA values to check their consistency. In the usual case, when all three AOA values were valid and consistent, the average value of AOA 1 and AOA 2 was used by the FCPCs for their computations. If either AOA 1 or AOA 2 significantly deviated from the other two values, the FCPCs used a memorised value for 1.2 seconds. The FCPC algorithm was very effective, but it could not correctly manage a scenario where there were multiple spikes in either AOA 1 or AOA 2 that were 1.2 seconds apart.

Although there were many injuries on the 7 October 2008 flight, it is very unlikely that the FCPC design limitation could have been associated with a more adverse outcome. Accordingly, the occurrence fitted the classification of a ‘hazardous’ effect rather than a ‘catastrophic’ effect as described by the relevant certification requirements. As the occurrence was the only known case of the design limitation affecting an aircraft’s flightpath in over 28 million flight hours on A330/A340 aircraft, the limitation was within the acceptable probability range defined in the certification requirements for a hazardous effect.

As with other safety-critical systems, the development of the A330/A340 flight control system during 1991 and 1992 had many elements to minimise the risk of a design error. These included peer reviews, a system safety assessment (SSA), and testing and simulations to verify and validate the system requirements. None of these activities identified the design limitation in the FCPC’s AOA algorithm.

The ADIRU failure mode had not been previously encountered, or identified by the ADIRU manufacturer in its safety analysis activities. Overall, the design, verification and validation processes used by the aircraft manufacturer did not fully consider the potential effects of frequent spikes in data from an ADIRU.

ADIRU data-spike failure mode

The data-spike failure mode on the LTN-101 model ADIRU involved intermittent spikes (incorrect values) on air data parameters such as airspeed and AOA being sent to other systems as valid data without a relevant fault message being displayed to the crew. The inertial reference parameters (such as pitch attitude) contained more systematic errors as well as data spikes, and the ADIRU generated a fault message and flagged the output data as invalid. Once the failure mode started, the ADIRU’s abnormal behaviour continued until the unit was shut down. After its power was cycled (turned OFF and ON), the unit performed normally.

There were three known occurrences of the data-spike failure mode. In addition to the 7 October 2008 occurrence, there was an occurrence on 12 September 2006 involving the same ADIRU (serial number 4167) and the same aircraft. The other occurrence on 27 December 2008 involved another of the same operator’s A330 aircraft (VH-QPG) but a different ADIRU (serial number 4122). However, no factors related to the operator’s aircraft configuration, operating practices or maintenance practices were found to be associated with the failure mode.

Many of the data spikes were generated when the ADIRU’s central processor unit (CPU) module intermittently combined the data value from one parameter with the label for another parameter. The exact mechanism that produced this problem could not be determined. However, the failure mode was probably initiated by a single, rare type of trigger event combined with a marginal susceptibility to that type of event within the CPU module’s hardware. The key components of the two affected units were very similar, and overall it was considered likely that only a small number of units exhibited a similar susceptibility.

Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The LTN-101 had built-in test equipment (BITE) to detect almost all potential problems that could occur with the ADIRU, including potential failure modes identified by the aircraft manufacturer. However, none of the BITE tests were designed to detect the type of problem that occurred with the air data parameters.

The failure mode has only been observed three times in over 128 million hours of unit operation, and the unit met the aircraft manufacturer’s specifications for reliability and undetected failure rates. Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Use of seat belts

At least 60 of the aircraft’s passengers were seated without their seat belts fastened at the time of the first pitch-down. Consistent with previous in-flight upset accidents, the injury rate, and injury severity, was substantially greater for those who were not seated or seated without their seat belts fastened.

Passengers are routinely reminded every flight to keep their seat belts fastened during flight whenever they are seated, but it appears some passengers routinely do not follow this advice. This investigation provided some insights into the types of passengers who may be more likely not to wear seat belts, but it also identified that there has been very little research conducted into this topic by the aviation industry.

Investigation process

The Australian Transport Safety Bureau investigation covered a range of complex issues, including some that had rarely been considered in depth by previous aviation investigations. To do this, the investigation required the expertise and cooperation of several external organisations, including the French Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation civile, US National Transportation Safety Board, the aircraft and FCPC manufacturer (Airbus), the ADIRU manufacturer (Northrop Grumman Corporation), and the operator.



Dealing With Nuclear Waste

2 12 2011

The Independent reports today on a written statement by UK Energy Minister Hendry to Parliament on what the Government is deciding to do with its radioactive waste from nuclear power generation.

The British government has decided for a project to convert plutonium waste into MOX fuel, maybe for “a new generation of nuclear power plants“.

The decision, which ends decades of uncertainty on how to deal with a growing stockpile of more than 112 tonnes of plutonium waste, was presented as a written Parliamentary statement by the energy minister, Charles Hendry.

Indeed for half a century Britain, like many other countries with nuclear power plants, has not known what to do with nuclear power’s most toxic waste product.

Nuclear power relies on highly radioactive “fuel”, formed usually in the shape of rods, which engage in a chain reaction in the core of a nuclear reactor and produce heat. The chain reaction converts substances eventually into other substances which are no longer suitable for purpose; the fuel is “spent” and must be replaced. But the “spent fuel” remains highly radioactive. It is very toxic, must be carefully shielded from the environment and people, and this must go on with current spent fuel for (the most optimistic minimum estimate) 10,000 years (the level at which radioactivity has reduced to that of the originally-mined uranium and the original basis for US standards).

What do you do with it? Where do you put it?

It is not clear that anyone has come close to solving this problem. Nuclear power has been around for half a century, this waste has been accumulating, and the nation with the most plants, the US, has no solution. There are and have been many proposals, but so far none has turned out to be workable. Most of the spent fuel is still stored on-site in pools filled with water (water is pretty good at stopping the neutrons which are the main product of radioactivity in nuclear fuel rods. You only need a few meters of it to trap all but a few which get lost in the background). No one thinks that is a solution for more than a few decades, let alone a minimum of 10,000 years. There is a movement to store as much as possible in so-called “dry casks”: sealed physical containment vessels which are self-cooling after the spent fuel has been sitting around for some number of years. But you still have to put the casks somewhere where they will be safe for a minumum of 10,000 years. Yucca Mountain in Nevada was for many years the preferred prospective location. One wonders, however, about the stability of any structure in a seismically active area of recent volcanism. Eight volcanoes have erupted within 50km of the site in the last million years (op. cit.), but maybe it’s OK for 10,000 years? That is the main point: nobody really knows. No one with a decent set of choices could reasonably choose a place in a seismically and volcanically active area. That says, correctly in my view, that there is no decent set of choices. That is the way it has been for half a century.

It is a problem in Germany also. Germany processes spent fuel in France (and soon in GB) and transports the processed product in dry casks (called “Castor”) by rail back into Germany. The transport has been regularly plagued by protests which block the rail lines, and a transport typically takes days to weeks. Protesters used to aim for Germany’s withdrawal from nuclear power. Now that the German Government has committed to that, what is the latest protest (ongoing at time of writing) about? The protesters are apparently not content with the “temporary” storage site at Gorleben in Lower Saxony (it is in an underground salt deposit, which they claim with some reason is geologically unstable over the long term) and apparently want it to be stored at a reactor site at Philippsburg, near Karlsruhe. That is unlikely to be long term (in the sense of 10,000 years) either, since most authorities judge that any long-term site must be underground, in geologically stable ground. The storage issue has not been solved in Germany, either.

What about Britain? The Independent speaks of

……..decades of uncertainty on how to deal with a growing stockpile of more than 112 tonnes of plutonium waste, was presented as a written Parliamentary statement by the energy minister, Charles Hendry.
Plutonium waste has been a headache for successive governments because it is a highly dangerous radioactive material that can be converted into weapons-grade material, making it a security risk. It’s also expensive to store.

So Britain doesn’t have a long-term solution either. Who does? (Maybe France or Japan?) What to do with the waste is a major unsolved issue with nuclear power.

According to the Independent, the “uncertainty” has gone. It’s going to be converted into “mixed oxide” (MOX) fuel. Fuel? Yes, for reactors which have not yet been built. So you solve the waste problem by building new reactors – which, um, then don’t create waste? Of course they do. You are thus using the present waste in a process which will ultimately generate even more waste, as well of course as some electricity. So, problem solved? Obviously not.

Suppose one just wants to store MOX fuel, not use it. Is it, say, less toxic than spent fuel? No. Can be stored more easily? Not as far as I know. Can be used somehow? Yes, in those new nuclear power plants; we’ve just been that route.

Does this solve the nuclear-waste-product problem in any reasonable way? No. Since the UK government is full of clever people who can think at least this far, it could be that there is another explanation for this decision.

One thought. Somebody will be paid £3bn pounds for doing it, if it happens. Money goes somewhere, and I imagine the prospective recipients might be rather keen on their share. The new waste generated by the new reactors that use the MOX fuel that came from the old waste is, well, a problem for someone who comes along later. Science will solve everything, won’t it?

But it’s not going to be clear sailing. The Independent continues:

But although Mr Hendry made it clear that the Government sees the “Mox option” as a priority, it is not certain that a new £3bn plant to convert the plutonium into Mox fuel will ever be built.

Mindful of the financial and technological disaster of the current Mox fuel plant at Sellafield in Cumbria, which has cost £1.34bn and produced a tiny fraction of the fuel it was scheduled to make, Mr Hendry said that a clear case has still to be made for a second Mox plant at Sellafield.

Oh. So the first, smaller attempt to do this kind of thing failed?

Well, let me qualify that. £1.34bn went somewhere, somebody got it for doing something, so that all went OK. But it apparently didn’t go into the ostensible goal of processing X amount of plutonium into MOX.

And on the basis of that experience apparently the best option is to try again, more and bigger?

I am sure the mistakes made in building the first reprocessing plant will all have been cataloged. I am also sure that attempts will be made assiduously to avoid them when building the second, bigger plant. I have also studied troubled large projects, indeed giving evidence before a UK Parliamentary committee on one. Many big projects fail to deliver on the goals at the time of commencement. Indeed, it’s a first for me to see someone suggest a larger second project on the back of a failed, smaller first one. Surely it should be received wisdom by now that any serious, careful estimate of the cost of such a second, bigger plant be accompanied with an equally serious, careful estimate of the likelihood of success or failure?

Given that this plan for apparently “dealing with” nuclear waste leaves all the questions open about how one ultimately deals with the waste, could something else be going on? What could it be?

First, contractors earn money for building the plant, whether it works or not, so they would be happy. Second, a current government can be seen to be “doing something” about the problem, no matter how superficial. Third, by processing and reusing fuel, the issue of what finally to do about the nuclear waste is put off into the future. (That strategy has clearly worked for governments in the past!)

Let us, though, be clear what the situation is. There is a real scientific and social problem of what on earth one can do with the highly toxic waste products of fission reactors. One cannot expect the current UK government, indeed any government at all, to implement a true solution when none is known yet to exist.

So maybe the Independent is being inappropriately forthright when it claims that uncertainty is at an end. Here is what Energy Minister Hendry actually wrote, as reported by the Independent:

“Only when the Government is confident that its preferred option could be implemented safely and securely, that is affordable, deliverable, and offers value for money, will it be in a position to proceed with a new Mox plant,” Mr Hendry said. In its response to a public consultation on Britain’s plutonium problem, the Government has not rejected other options. One is to convert the 112 tonnes of plutonium dioxide powder stored at Sellafield into glass or concrete blocks that could be buried permanently in a deep waste repository. Another is to use the plutonium directly as fuel for fast reactors, if these can be developed commercially in the coming decade.

“While converting the plutonium into Mox is the most credible and technologically mature option, the Government remains open to any alternative proposals for plutonium management that offer better value to the taxpayer, and will seek to gather more details on all options,” Mr Hendry said.

That seems less than certain to me. According to this, the UK government has set priorities on the “viable” options. It has not actually decided to do anything.

So am I (and the Independent) making a lot of fuss about not very much? Here’s a thought. We all agree that something does indeed need to be done about nuclear waste. Suppose somebody “does something”, what is it going to be? Well, it’s going to be starting to implement this “plan”, since, as the priority option, it is obviously the thing to pick if anything is to be done.

But options remain open. In case a detractor says “why on earth are you doing this? It makes no sense“, the Energy Minister can reply “only when we are confident, etc, etc, the Government remains open to any alternative proposals, etc.

And when a sufficient amount of money has been spent, someone can say “oh look, we’ve got half a MOX plant! Well, better get on and finish it, then! Don’t like to waste money…..

Maybe it’s just the time of year. I haven’t hung my Christmas lights either. Or maybe the UK government has been reading its seasonal literature and the nuclear contractors hired a lobbyist name of Bob Cratchit.