The Accident to Qantas Flight 72, VH-QPA, in October 2008

21 12 2011

The Airbus A330-303 VH-QPA experienced uncommanded nose-down pitch commands while in cruise at FL370. Lots of unsecured people were thrown to the ceiling, and some were injured severely. The aircraft declared an emergency and landed as soon as practicable, at Learmonth, where the injured were treated and several hospitalised. It has been known for a while that the accident was caused by data anomalies from a air data computer (ADIRU) which were not filtered out by the primary flight control computers (FLight Control Primary Computers, FCPC, also known as PRIM). However, it has been a mystery – and remains so – how the anomalous data values were generated. It has happened three times: twice with the unit on VH-QPA, and once on another unit on another aircraft, also Qantas, also in Western Australia, within a couple of months of this incident.

The fix is apparently to modify the BITE test of the ADIRU specifically to look for such anomalies, and to modify the data-filtering algorithms of the Flight Control Primary Computers (FCPC, also known as PRIM) of the A330.

The Final Report is now available on the ATSB WWW site.

There was a note from Andrew Heasley in Risks 26-67 with a title saying the accident was “Blamed on Software“, pointing to a newspaper article. I find this claim misleading. The problem which arose had nothing to do with anything for which any software engineer would have been responsible.

The fixes were implemented in both SW and HW, but fixes to non-SW problems are very often implemented in SW.

The PRIMs ran a data-assurance algorithm for data received from three different ADIRUs, which are electronic boxes built by a different manufacturer. This data assurance algorithm had a specific vulnerability to spiky angle-of-attack (AoA) data presented in a particular time-sequential manner, which was exploited during the occurrence. The algorithm, which uses AoA data from three ADIRUs, filters out multiple data spikes from a unit which occur within a specific time frame. Spikes on the culprit ADIRU occurred with similar values just over the boundary of this time frame, and were thus taken as veridical by the PRIMs. The resolution algorithms for the AoA data (with that from the other ADIRU units) in the PRIMs let these values through, and the PRIMs reacted accordingly by commanding sudden nose-down pitch.

Responsibility for the design of such algorithms lies clearly with those who are experts on the engineering of electronic data generation and transmission equipment, not on any software engineers.

To give a similar example with which I been recently involved, it turns out that signals of certain frequencies in AC electric circuits can bypass the Type A and Type B circuit protection equipment (circuit breakers) that are required in most electric circuits (household and industrial) in Germany. A committee on which I sit has recently considered attaching equipment which is, as far as we know, theoretically capable of generating such frequencies to such circuits. A similar situation, how to handle anomalous signals, but no SW in sight. Pure electrical engineering.

Concerning my earlier note here on Certification Requirements for Commercial Airplanes, I find it interesting and commendable that the Bureau considered likelihoods of events in their summary (quoted below). However, I don’t believe they formulated it in quite the words I would have liked to have read.

They give reason to classify the event as “hazardous”, and with a fleet operating experience of 28 million flight hours this occurrence fits within the expected value (a technical term) of the operating time within which the effects of a hazardous event may occur (defined to be less than or equal to one occurrence within ten million operating hours), according to the acceptable means to determine compliance with certification criteria (now known as AMC 25). Notice it is not the event itself of which they assess the occurrence – that has occurred three times – but the deleterious effects upon safety of the event, which have only occurred once.

They speak of “certification requirements“. Strictly speaking, this is incorrect. The certification requirements are expressed in CS 25 and do not involve probabilities. The severity classification terms “catastrophic”, “”hazardous” etc and their associated acceptable/unacceptable frequencies occur in risk-matrix-type form in the Acceptable Means of Compliance document which accompanies the certification requirements (AMC 25), not the requirements themselves. (I note that these documents were called something slightly different at A330 certification time, 1993).

The certification requirements themselves are quite clear: the airplane shall behave in such-and-such a manner. If a wing falls off, or a flight control computer sends it into a loop, it is obviously not behaving in that manner; thus violating certification requirements. However, it is accepted that one cannot provide proof that such untoward things will never ever happen (will the sun rise tomorrow? Will your steering wheel come off in your hands? WIll your control sidestick come out of its holder in your hand?), so a less strenuous regime based on arguing likelihoods is defined as an “Acceptable Means of Compliance” with the regulations for purpose of certification.

This is not hair-splitting. It has consequences, in particular in this case, for how anomalies are dealt with, as follows.

If the requirement were that, say, “hazardous effects shall only occur on average once in between 10^7 and 10^9 operating hours“, which is what the AMC says you have to show to demonstrate compliance acceptably, then it would have been open to the manufacturer to do nothing in reaction to the QF72 event: the hazardous effects occurred only within the expected time value of their occurrence. If you think about it, it would also be open to a manufacturer to do nothing until the second occurrence of any hazardous or indeed catastrophic effects, even if the problem occurred first within the early experience of flying the aircraft! This is simply a consequence of the meaning of the probabilistic concepts used.

Whereas, as things now stand, separating requirements, which are absolute, from acceptable compliance (which may be based on occurrence frequency) any in-flight anomalous behavior must be fixed or the airworthiness certificate will be withdrawn. This is because such behavior violates the written requirements, that the aircraft shall not behave that way. To repeat, the conditions on behavior are absolute, not likelihood-based.

And that is how one wants things: The requirements are absolute, but it is accepted that in science and engineering you are often only convinced to some degree, so it is regarded as acceptable to argue your conviction up to a certain degree, and not to have to prove it, which would likely be impossible. But if something does go wrong, you want it fixed right away.

One can argue that any given set of occurrences is compatible with any probability requirement whatever, and thus that probabilistic requirements are inappropriate to determine airworthiness in any case. However, I don’t think such an argument works. Say these three events had occurred within 3 million operating hours, each with damage. One could estimate the likelihood that an piece of equipment fulfilling the condition of an expected value of at most once in 10 million operating hours to exhibit three events within 3 million operating hours. One would conclude that it is unlikely, say with small probability P. It follows that the situation that the aircraft fulfills the acceptable-compliance criterion has the same probability P. The small probability P that the aircraft acceptably complied with certification requirements would provide good reason for withdrawing the airworthiness certificate.

Concerning the data anomaly itself stemming from the ADIRU, its cause remains a mystery. The report says:


Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The report says that the manufacturer is developing a modification to the BITE to detect such failure modes:


Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Here is the executive summary. It is well and concisely written. I include the three paragraphs about seat belts and the investigative process for completeness.

Executive Summary

At 0132 Universal Time Coordinated (0932 local time) on 7 October 2008, an Airbus A330-303 aircraft, registered VH-QPA and operated as Qantas flight 72, departed Singapore on a scheduled passenger transport service to Perth, Western Australia. At 0440:26, while the aircraft was in cruise at 37,000 ft, ADIRU 1 started providing intermittent, incorrect values (spikes) on all flight parameters to other aircraft systems. Soon after, the autopilot disconnected and the crew started receiving numerous warning and caution messages (most of them spurious). The other two ADIRUs performed normally during the flight.

At 0442:27, the aircraft suddenly pitched nose down. The FCPCs commanded the pitch-down in response to AOA data spikes from ADIRU 1. Although the pitch-down command lasted less than 2 seconds, the resulting forces were sufficient for almost all the unrestrained occupants to be thrown to the aircraft’s ceiling. At least 110 of the 303 passengers and nine of the 12 crew members were injured; 12 of the occupants were seriously injured and another 39 received hospital medical treatment. The FCPCs commanded a second, less severe pitch-down at 0445:08.
The flight crew’s responses to the emergency were timely and appropriate. Due to the serious injuries and their assessment that there was potential for further pitch-downs, the crew diverted the flight to Learmonth, Western Australia and declared a MAYDAY to air traffic control. The aircraft landed as soon as operationally practicable at 0532, and medical assistance was provided to the injured occupants soon after.

FCPC design limitation

AOA is a critically important flight parameter, and full-authority flight control systems such as those equipping A330/A340 aircraft require accurate AOA data to function properly. The aircraft was fitted with three ADIRUs to provide redundancy and enable fault tolerance, and the FCPCs used the three independent AOA values to check their consistency. In the usual case, when all three AOA values were valid and consistent, the average value of AOA 1 and AOA 2 was used by the FCPCs for their computations. If either AOA 1 or AOA 2 significantly deviated from the other two values, the FCPCs used a memorised value for 1.2 seconds. The FCPC algorithm was very effective, but it could not correctly manage a scenario where there were multiple spikes in either AOA 1 or AOA 2 that were 1.2 seconds apart.

Although there were many injuries on the 7 October 2008 flight, it is very unlikely that the FCPC design limitation could have been associated with a more adverse outcome. Accordingly, the occurrence fitted the classification of a ‘hazardous’ effect rather than a ‘catastrophic’ effect as described by the relevant certification requirements. As the occurrence was the only known case of the design limitation affecting an aircraft’s flightpath in over 28 million flight hours on A330/A340 aircraft, the limitation was within the acceptable probability range defined in the certification requirements for a hazardous effect.

As with other safety-critical systems, the development of the A330/A340 flight control system during 1991 and 1992 had many elements to minimise the risk of a design error. These included peer reviews, a system safety assessment (SSA), and testing and simulations to verify and validate the system requirements. None of these activities identified the design limitation in the FCPC’s AOA algorithm.

The ADIRU failure mode had not been previously encountered, or identified by the ADIRU manufacturer in its safety analysis activities. Overall, the design, verification and validation processes used by the aircraft manufacturer did not fully consider the potential effects of frequent spikes in data from an ADIRU.

ADIRU data-spike failure mode

The data-spike failure mode on the LTN-101 model ADIRU involved intermittent spikes (incorrect values) on air data parameters such as airspeed and AOA being sent to other systems as valid data without a relevant fault message being displayed to the crew. The inertial reference parameters (such as pitch attitude) contained more systematic errors as well as data spikes, and the ADIRU generated a fault message and flagged the output data as invalid. Once the failure mode started, the ADIRU’s abnormal behaviour continued until the unit was shut down. After its power was cycled (turned OFF and ON), the unit performed normally.

There were three known occurrences of the data-spike failure mode. In addition to the 7 October 2008 occurrence, there was an occurrence on 12 September 2006 involving the same ADIRU (serial number 4167) and the same aircraft. The other occurrence on 27 December 2008 involved another of the same operator’s A330 aircraft (VH-QPG) but a different ADIRU (serial number 4122). However, no factors related to the operator’s aircraft configuration, operating practices or maintenance practices were found to be associated with the failure mode.

Many of the data spikes were generated when the ADIRU’s central processor unit (CPU) module intermittently combined the data value from one parameter with the label for another parameter. The exact mechanism that produced this problem could not be determined. However, the failure mode was probably initiated by a single, rare type of trigger event combined with a marginal susceptibility to that type of event within the CPU module’s hardware. The key components of the two affected units were very similar, and overall it was considered likely that only a small number of units exhibited a similar susceptibility.

Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The LTN-101 had built-in test equipment (BITE) to detect almost all potential problems that could occur with the ADIRU, including potential failure modes identified by the aircraft manufacturer. However, none of the BITE tests were designed to detect the type of problem that occurred with the air data parameters.

The failure mode has only been observed three times in over 128 million hours of unit operation, and the unit met the aircraft manufacturer’s specifications for reliability and undetected failure rates. Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Use of seat belts

At least 60 of the aircraft’s passengers were seated without their seat belts fastened at the time of the first pitch-down. Consistent with previous in-flight upset accidents, the injury rate, and injury severity, was substantially greater for those who were not seated or seated without their seat belts fastened.

Passengers are routinely reminded every flight to keep their seat belts fastened during flight whenever they are seated, but it appears some passengers routinely do not follow this advice. This investigation provided some insights into the types of passengers who may be more likely not to wear seat belts, but it also identified that there has been very little research conducted into this topic by the aviation industry.

Investigation process

The Australian Transport Safety Bureau investigation covered a range of complex issues, including some that had rarely been considered in depth by previous aviation investigations. To do this, the investigation required the expertise and cooperation of several external organisations, including the French Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation civile, US National Transportation Safety Board, the aircraft and FCPC manufacturer (Airbus), the ADIRU manufacturer (Northrop Grumman Corporation), and the operator.



Concorde, Ten Years On, Part 2

9 12 2010

The Concorde accident to F-BTSC on 25 July 2000 is about as well understood as to causes as any accident can be. There is also, unusually, a more or less linear connection of causes from an exceptionally rare event: the deposition of a particularly hard and sharp strip of metal, which shouldn’t have been mounted in the first place exactly because of such possibilities, on exactly the part of the runway at which Concorde’s tires bear the greatest load – and the aircraft indeed running over it, and it’s not a big strip. The Concorde’s ground run goes up to just about 200 kts at rotation, I understand, compared with that of a Boeing 747 at about 160 kts. Furthermore, the delta wing generates some negative load, putting even more weight on the tires, at rotation, before it changes to positive and the aircraft lifts off. The sequence of events that then ensued was, as far as I know, not anticipated by anyone in the development or certification or analysis of the aircraft. To my mind, it is hard to see how it could have been. To me, this is a freak accident, the «not expected to occur during the operational lifetime of the aircraft», which is the strictest category of likelihood contemplated in civil aeronautical certification.

But some differ, for example Tom Ferrell in this note to the York Safety-Critical Systems Mailing List. Tom thinks the accident had precursors, which showed, in advance of the accident, that

Regardless of causal agent, the Concorde was susceptible to severe damage from a relatively common occurrence.

He means there had been tire burst incidents, which indicated problems with the design. So is this just a matter of personal taste, say, like wine? Ladkin tastes “freak” and Ferrell tastes “foreseeable” in the same glass, and that’s it? Or is there, as I would prefer to believe, an objective way of evaluating the views, such that one can be shown to be right (or more accurate) and the other wrong (or misleading) in some way?

I think it is partly a matter of what you lump together, and what you don’t. Do you lump together all tire bursts, including this one, and all damage, including this damage, or don’t you? Is this lumping arbitrary, a matter of individual perception? I don’t think so. I think there are objective principles, on which so far I have only an intuitive handle.

How to indicate these principles? I try to show them here by means of a hypothetical cross-examination of Ferrell’s claim. Here goes.

M’lud, regardless of causal agent, the Concorde aircraft was susceptible to severe damage from a relatively common occurrence.

I see, thank you, counsel. What was that common occurrence?

A burst tire, m’lud.

Thank you, counsel. And what was that severe damage?

A 32cm square hole in the lower wing skin, m’lud, which also served as the fuel tank skin.

I see. Had that ever happened before in the history of the airplane?

No, m’lud.

You say “susceptible”. Had damage ever occurred to the lower wing skin, except in this case?

Six times, m’lud.

And how many times was that due to your “common occurrence“, a tire burst?

The lower wing skin was punctured on five occasions when a tire burst, m’lud.

But that is not what I asked you, counsel. I asked you in which of these events the damage to the lower wing skin was due to the tire burst.

It is supposed, three times, m’lud.

You say “supposed“, counsel. Why so?

As far as we know, in those cases, m’lud, the damage sequence was causally initiated by a tire burst. It is conceivable, although very unlikely, that a contemporary but independent damaging event caused the lower-wing-skin penetration, but there was no evidence for that.

I see. Thank you for your care in phrasing this, counsel. And what were the two other events?

In one, on 29 January 1988, the tie bolts holding the two wheel halves together sheared, and in the resulting sequence one of the bolts penetrated the Number 7 tank, leaving a half-inch hole. In the other, on 15 July 1993, there was a braking-system jam, and the Number 8 tank was punctured as a result of the damage sequence.

So, if I understand you, counsel, you tell me that, before the fatal accident at Gonesse, three times it had occurred that the lower wing skin was punctured due to your “common occurrence“, a tire burst.

Yes, m’lud.

And how many years did the Concorde fly in service before the Gonesse accident?

Just over 24 years, m’lud. The first revenue flight was 24 May, 1976.

And how many flight cycles?

About 84,000, m’lud.

I see. That is quite a long time. And, to me, quite a large number of flights, although of course by no means so large as with most aircraft in commercial use nowadays. So are those three occasions a lot or a little, counsel?

With respect, m’lud, I offer no opinion on that question.

So there are these “common occurrences“, which had occurred – how many times, counsel?

Aviation Safety Network has a record of 55 occasions after service introduction in which tires burst, m’lud.

Common enough, I suppose. And these common occurrences had caused damage other than to the tire on – how many occasions, counsel?

Aviation Safety Network has a record of 28 occasions on which other damage occurred, m’lud.

Does that include the two above in which the damage was not initiated by a tire burst, counsel?

Yes, m’lud.

So there were 26 occasions on which, as far as we know, a tire burst initiated damage to other parts of the aircraft?

Yes, m’lud.

So I think you have established, counsel, that a common occurrence, a tire burst, could cause damage, and thus that the aircraft was susceptible to damage from this common occurrence. But you want to establish more than that, don’t you, counsel. You wish to say that the aircraft was susceptible to severe damage.

Yes, m’lud.

Is “severe damage” a technical term used in aviation, counsel?

No, m’lud.

So it is your term, counsel. What do you mean by it?

I mean that the safety of the flight is affected by the damage, m’lud.

Thank you, counsel. Is there any similar term used in aviation?

The U.S. National Transportation Safety Board Part 830 defines an “incident” to be an occurrence other than an accident, associated with the operation of an aircraft which affects or could affect the safety of operations. The same regulation defines an “accident” to be an occurrence [associated with the operation of an aircraft] in which any person suffers death or serious injury, or in which the aircraft receives substantial damage.

I see. Is there a definition of “substantial damage“, counsel?

Yes, m’lud. “…..damage or failure which adversely affects the structural strength, performance, or flight characteristics of the aircraft, and which would normally require major repair or replacement of the affected component. Engine failure or damage limited to an engine if only one engine fails or is damaged, bent fairings or cowling, dented skin, small punctured holes in the skin or fabric, ground damage to rotor or propeller blades, and damage to landing gear, wheels, tires, flaps, engine accessories, brakes, or wingtips are not considered ‘substantial damage’ for the purpose of this part.” This definition is similar to other definitions of significant damage, used in definitions of accidents and incidents in, say, the International Civil Aviation Organisation Annex 13, which defines reporting requirements for its member states.

Thank you, counsel. And in which of those 26 tire-burst incidents you enumerated above was “substantial damage“, according to this definition, incurred?

In the incident at Washington Dulles airport on 14 June 1979, m’lud. The performance of the aircraft was affected in that fuel was lost through the debris penetrations of the tank at a rate of up to 4 kg per second. It was unable to continue its flight to London. The aircraft lost 7 tonnes of fuel before it landed again at Washington Dulles.

And in others, counsel?

In no others, according to the definition, m’lud.

I see. Are there incidents in which a fuel tank was penetrated, in which the performance of the aircraft, its structural strength, or its flight characteristics were not substantially affected?

Yes, m’lud. On 29 January 1988, the incident in which the wheel-half tie-bolts broke and a bolt punctured the tank on take-off from London, the flight continued to its destination, New York.

I see. How large was this puncture?

The hole was half an inch, so about 1.3 cm, in diameter, m’lud.

So it appears that a puncture in a fuel tank, even a fairly large hole, does not necessarily count as “substantial damage“?

No, m’lud, it does not necessarily count so.

Are there any other common technical meanings of “severe damage” or “significant damage” which we might want to consider, counsel?

I think so, m’lud. For example, damage which could affect the safety of flight, the definition I suggested.

Could affect“, counsel, or “does affect“? For example, during the 29 January 1988, was the safety of the flight affected?

Apparently no, m’lud.

Was the safety of flight affected in any of the other tank-penetration incidents besides the 14 June 1979 incident at Washington Dulles?

I don’t believe so, m’lud.

Could it have been?

I believe so, m’lud.

How?

Maybe fuel streaming from a hole can catch fire when it meets engine exhaust, m’lud.

I see. Does it commonly do so, counsel? Do you know of any other incident in commercial aviation when fuel streaming from a smallish hole, such as this, caught fire?

Actually, m’lud, I don’t.

Are there any other ways in which safety of the flight could be affected by such a leak?

When the aircraft lands, m’lud, the brakes heat up, and leaking fuel could fall onto hot brakes and catch fire.

Has this happened, counsel?

Yes, m’lud.

Are there ways to prevent it happening?

Yes, m’lud. If a crew knows they have a leak – and if the leak is substantial you can usually see the stream behind the wing from the rear passenger seats during flight – then they can have fire services meet the plane on landing and cover the brakes and ground under the leak with fire-suppresant foam. This mostly suffices.

Thank you, counsel. So igniting this fuel is a event for which there exist known and effective countermeasures.

Yes, m’lud.

So although such an event “could affect” the safety of flight, it mostly doesn’t do so.

It appears not, m’lud.

So it appears that penetrations of the fuel tank in themselves do not count as “substantial damage“, and they do not necessarily count as damage which affects the safety of the flight. But they might count as events which could affect the safety of flight if we are sufficiently imaginative in devising scenarios.

It seems so, m’lud.

Let us see how imaginative I may be. As far as I understand quantum mechanics, atomic particles may engage in random motion, that is, displacement of position without apparent cause.

As far as I also understand quantum mechanics, m’lud, that is so.

So it could be, counsel, that all the atomic particles in a Concorde translate 4 meters to the left all at the same time, leaving the passengers sitting, well, somewhere in space outside the fuselage.

I suppose it could be, m’lud.

And those passengers would probably fall to the ground and injure themselves or die.

I suppose so, m’lud.

So it could be, counsel, that the Concorde, indeed any aircraft, suddenly leaves its passengers sitting outside the airframe, leading to serious injury or death.

I suppose so, m’lud.

I am, counsel, as you see, sufficiently imaginative in devising scenarios. You have presented me with two partially overlapping definitions of significant damage, of which the second is indeterminate between “could be” and “is“. I don’t find the “could be” interpretation very helpful, as you see, because I am, as you also see, sufficiently imaginative. And I don’t think any objective safety property of a commercial airplane should depend so heavily on my sufficient imagination. So I am going to interpret “severe damage” as meaning damage which is either substantial in the sense of NTSB rule 830 or which does (not “could” but “does“) affect the safety of flight.

Yes, m’lud.

On which occasions, then, did your “common occurrence“, a tire burst, initiate a causal sequence in which severe damage resulted?

On 14 July 1979 at Washington Dulles, m’lud, and on 25 July 2000 resulting in the crash in Gonesse.

The damage which resulted in the Gonesse crash was then, by definition, substantial, as well as severe, wasn’t it, counsel.

Yes, m’lud.

So, since this severe damage actually happened on that occasion, we can say that, even before this occurred, the aircraft was susceptible to exactly this severe damage, in the sense that, since it did happen, it follows that the aircraft was susceptible to its happening, simply through the usual meaning of the word “susceptible“.

Yes, m’lud, that is what I claim.

Let’s look a little closer at this word “susceptible“. There are some people who claim that human beings spontaneously ignite. Not often, but occasionally. All that is left is ashes. If that is true, and I believe that this is a very, very big “if“, then human beings are “susceptible to spontaneous combustion” aren’t they, counsel?

Yes, m’lud. But I share your scepticism of the phenomenon.

The point, counsel, is this. We know whether or not human beings are susceptible to spontaneous combustion only in so far as we know actual examples of human beings spontaneously combusting.

It seems so, m’lud.

And, further, let us suppose that there are certain circumstances C in which human beings spontaneously combust, and if those circumstances do not obtain, then they don’t. Then, surely, we are obliged, by virtue of not wishing to mislead our fellow men and women, to say that human beings are susceptible to spontaneous combustion in circumstances C and to indicate that, if circumstances C do not obtain, there is nothing to worry about.

That seems to me reasonable, m’lud.

And I take it that you do not wish to mislead me, counsel!

Certainly not, m’lud!

Then when you claim that the Concorde is “susceptible to severe damage [resulting from] a common occurrence“, which we have more or less agreed is a phrase which may be able to describe the Concorde aircraft, I now want to know if there are any circumstances C which you should be telling me about, under which severe damage resulting from your common occurrence, a burst tire, may be realised. Please note the condition: you are to tell me about circumstances C in which, if they obtain, the accident sequence results, and for which, if circumstances C do not obtain, the accident sequence does not result.

Yes, m’lud. The accident sequence is as follows. A titanium strip lay edge-on on the runway at or near the rotation point of the Concorde. It cut sharply into a tire, causing a tire burst resulting in at least two chunks of tire of size approximately 4.5 kg. It is presumed that one of these expelled chunks impacted the lower wing skin, the skin of one of the fuel tanks, causing a shock wave which blew out the fuel tank skin from inside, near the impact point of the tire chunk , resulting in a hole of size about 32 cm square being formed in the fuel tank, and of course fuel streaming out. The fuel ignited, and burned from very near the fuel tank hole, causing varying loss of thrust in two engines, as a result of which the aircraft was unable to attain positive-rate-of-climb flying speed, and was also subject to thermal damage from the fire under the left wing. As the damage progressed, control was lost and the aircraft crashed in Gonesse.

Thank you, counsel. How do we know that the loss of thrust rendered the aircraft unable to attain the appropriate flying speed?

That is elementary aerodynamics of Concorde, m’lud, and is not disputed.

Thank you. How do we know that the fire caused loss of thrust?

Calculations show that, if air is ingested into the engine intakes at a temperature approximating that of the burning fuel, thrust is lost at more or less the observed and recorded rate.

Thank you. How do we know the fire was present at exactly the unfortunate point to be ingested?

Photographs of the accident, m’lud.

Thank you.

How did the fire attain the state in which it was photographed?

We don’t know, m’lud. We would expect a fuel fire to start when it has been ignited by hot gases from the turbine engines, behind the engines, which of course were in reheat at the time. As far as anyone knows, the front of such a flame cannot travel relatively forward at the speed at which the aircraft was travelling.

So we would expect the fire to remain behind the wing structure, behind the engine exhaust?

Indeed so, m’lud.

But this fire didn’t. Its front came forward underneath the wing, and you have indicated we do not know why.

That is correct, m’lud. There is speculation that it might have been ignited by an electrical spark from some wiring in the undercarriage bay.

Do we know that, counsel?

No, we do not know, m’lud. It is speculation, because we cannot otherwise understand how the flame front came forward under the wing.

Thank you. So we basically do not know the causal sequence between fuel released from the tank and the the engines consequently operating at reduced power.

It seems so, m’lud. We do not know why the flame front moved forward.

But of course there would have been no flame front had fuel not been streaming out of a hole.

Indeed so, m’lud.

How big was the hole?

About 32 cm square, m’lud.

That is a big hole! Did such holes occur during any other of your “common events“, counsel?

No, m’lud. The largest was 1 inch x 1.5 inches, caused by metal debris on 15 November 1985. The second largest was the hole of size 0.5 inch diameter on 20 January 1988, which event we have already mentioned.

So this hole, in the Gonesse accident, was 160 times larger than the largest hole which had previously been caused, and 790 times larger than the second-largest hole which had previously been caused. That is an enormous difference, counsel! Why is that?

The hole in the Gonesse accident, m’lud, was not caused through tank penetration by debris, but through shock-wave convergence punching a hole through the tank skin from inside.

That is, if I understand you, counsel, a much larger hole, two to maybe three orders of magnitude larger than any that had previously occurred, made by a completely different mechanism.

That appears to be so, m’lud.

And are such kinds of events, reminding me of your phrasing “common occurrence“, counsel, common in commercial aviation?

No, m’lud. This occurrence of the phenomenon is unique in the history of civil aviation as far as we know.

Thank you for your frank answer, counsel. But people knew about this phenomenon, did they?

Military engineers knew of the phenomenon from battle-damage studies, m’lud. It is not clear if any engineer in civil aviation, if anyone involved with civil aviation, knew of this phenomenon before the Gonesse accident. After the accident, military engineers informed the accident investigators of what they knew.

And what of the tire pieces that caused this phenomenon?

It was shown by experiment, m’lud, that a piece of rubber weighing about 4.8 kg and travelling at a relative speed of about 120 m/s, that is something over 300 mph, which could in theory occur due to a Concorde tire bursting at the point in the take-off sequence at which it did, could trigger the shock-wave phenomenon with a proportionate loss of tank skin.

And you have said that two chunks of tire of about that size were found amongst the runway debris.

That is correct, m’lud.

Could any other phenomenon of which we know, say a tank penetration by debris consequent to a “common” tire burst (I use your phrasing), cause the release of a 32 cm square piece of the fuel tank wall?

Not that we know of, m’lud, no.

So we have only one explanation to hand of the known size of the hole?

That is correct, m’lud.

And this explanation, this phenomenon, is otherwise unknown in the history of civil aviation.

That seems to be so, m’lud.

Thank you. If I understand you, this phenomenon was triggered by the impact of a chunk of tire of about 4.5 kg or so?

As far as we can tell, m’lud.

And chunks of this size, of tire pieces or indeed of other material, are frequent, or usual, during your “common occurrences“, tire bursts?

Actually no, m’lud, they are not.

I see. Have they otherwise occurred in any of the tire-burst events, counsel?

Actually, m’lud, they have not.

That is, they are unique to the Gonesse accident?

It appears so, m’lud.

How did they occur?

The tire was apparently cut by a titanium strip lying on the runway near the rotation point of the aircraft, m’lud.

I see. Pieces of metal left lying on the runway cut Concorde tires into 4.5 kg chunks, apparently?

Not any pieces of metal, m’lud, according to experiments undertaken after the accident. Titanium. Titanium is unusually hard. Other metals just crush when the tires run over them.

I see. But titanium strips are to be found lying on runways every so often, I take it?

Actually no, m’lud. This is the only recorded instance ever of a sharp titanium foreign object lying on a runway with commercial operations. Of course we don’t know about the military, since they do not share their records.

Why would that be, counsel, that this is the only instance?

One reason, m’lud, might be poor record-keeping. Another reason might be that titanium is not used on aircraft in places in which it might fall off on a runway.

Oh! So why did it happen here, counsel?

A mistake, m’lud.

I imagine a very, very rare mistake, counsel?

Yes, m’lud. As I mentioned, this is the only recorded instance of such debris lying on a runway at a commercial airport.

So, if I understand you, counsel, the shock-wave phenomenon can only happen via large chunks of debris, and the only way in which large chunks of debris from a burst tire have been known to occur is in this very accident, through cutting by a titanium strip, debris of which there has not been another recorded instance, in part because the use of titanium in a way in which it might separate from the aircraft during take off or landing is proscribed?

That seems to be so, m’lud.

So the circumstances C in which your “common occurrence” can lead to “severe damage” are, as far as we can tell:
(a) a flame front in the streaming fuel from an unusually large hole “moving forward” through an unknown mechanism to burn under the wing, in front of the engine intakes;
(b) a fluid shock wave punching out the unusually large portion of the fuel tank wall to create the hole;
(c) an unusually large chunk of debris creating the shock-wave sufficient to punch out the unusually large hole;
(d) this unusually large chunk indeed impacting the fuel tank wall, rather then being ejected in another direction;
(e) a titanium metal strip lying on the runway near enough to the rotation point to cut a tire which happens to run over it into suitably, unusually large chunks.

That seems to be so, m’lud.

I conclude, counsel, in the words of your claim, that the Concorde aircraft is susceptible to severe damage resulting from a common occurrence (tire burst) under the circumstances (a)-(e) just elaborated. And that, in order not to mislead us, your claim should include the supplementary phrase “under the circumstances (a)-(e) just elaborated“. If you agree to include that wording, counsel, I shall grant your claim. If not, I shall reject it. Accordingly, do you wish to remain with your original wording, or to amend it?

Counsel’s reply is not recorded because the recorder had used up its batteries.



Concorde, Ten Years On

6 12 2010

I understand that Simon Foreman observed at a meeting of the RAeS Law Group on 28 April this year on the criminalisation of aviation accidents, reported here in Flight International by David Learmount, that the French legal system does not have a mechanism of the English legal system, the inquest, to determine what went on in an accident. It seems to follow that, in France, for the state to determine what indeed went on in an incident of public interest, there must be a criminal trial.

First point: there at least two reasons for society to determine what went on. The first is to prevent a recurrence. This is the reason for the ICAO-mandated accident investigation bodies, here the BEA. They have long done their job.

The second reason is to apportion responsibility for compensation, an age-old and widespread human activity. Concerning this second reason, it’s a shame for all that France doesn’t have inquests. I imagine many French people might agree. It is particularly harsh for one person by the name of John Taylor.

Second: why an inquest?

Amongst other things , the results of an inquest help figure out who should ultimately pay. There is an ancient general principle of compensating victims of mishaps and this should not only follow rules but also be seen to be “fair”, adjudicating amongst competing claims, and that is what an inquest does.

Some commentators, including the BBC in their report, have spoken of “gaining closure” for the victims’ families. This notion is a US import to Europe and not one with which I sympathise, even when I was living in the US. I don’t sympathise with it, in part because it gives cover to seeking revenge, an activity of which I expressly do not approve in the case of accidents.

In particular, an inquest is not a criminal trial. It doesn’t punish anyone. It assigns cause.

Third: some have speculated, as in this note on PPRuNe that this will be a bonanza for tort lawyers.

If this follows the time scale of most major commercial airline accidents, seeking compensation for victims’ families will be mostly over by now. The airline (that is, the airline’s insurance company) will have already paid to settle most or all tort claims, as is by now the general practice in commercial aviation. The cost is reported to be in the realm of €100 million.

Fourth, the ruling is reported to contain the following apportionment: Continental 70%, EADS 30%, everyone else (Air France, DGAC, Paris Airports Authority, etc) 0%. That means that the insurance company will be negotiating with those parties to recover the relevant proportion of its costs. Since there is now a legal ruling which will act as precedent, there would be little point in disputing it in court.

So that will settle the compensation bit.

Fifth, what is this ruling based on?

The ruling is based on the obvious physical ABC of the accident occurrence.

The report said: titanium strip fell off Continental onto the runway; Concorde ran over strip; strip sliced into tire and caused tire burst of unprecedented form and strength; large tire fragment hit tank; impact shock wave caused tank to explode from within; resulting hole allowed fuel to stream out in large quantity; fuel was ignited (not completely sure how, but probably by reheat); fire engulfed critical wing structure and contributed to critical performance degradation of two engines; Concorde cannot accelerate after TO on two engines alone (BTW, there is no evidence that Concorde was overweight at TO) and went down.

That’s what the court found also, as far as I understand the verdict (not yet having read it :-) ).

People have said “missing spacer“. Our work on that said: not causally relevant.

People have said “overweight at dispatch“. Maybe, but not at takeoff, as far as anyone can tell.

People have said “airport should have swept runway better“. Maybe , but that wasn’t a direct contributing cause in the intuitive sense of the above sequence of physical events. It would be like blaming the police for Fred’s broken jaw in a street fight because they weren’t around at the time. Thousands of years of legal tradition says the person responsible for the broken jaw is the person who threw the punch. So here: the court said that the entity responsible for the burst tire is the entity that left the titanium strip on the runway; and further, as I understand it, the person who mounted that titanium instead of an aluminium part (presumably because he was judged to have made a professional error: he should have known to mount a softer metal); as well as, to some degree, the people responsible for the aircraft design, even though (and others will agree with me here loudly) the airplane was a triumph of aeronautical technology, as well as the most beautiful artifact ever to have taken to the skies.

Other people (Continental, apparently) said the plane was on fire before it encountered the strip. The report, as well as all of the people I know who know about Concorde, indeed, physical common sense given the undisputed evidence of what happened, have no explanation at all of how that could possibly have been the case. The evidence presented is circumstantial – eye witness testimony from witnesses who were some way away from the scene. There is no physical explanation of the accident which coheres with that testimony at all, after ten years of thinking about it. I take it that that eye-witness testimony was rejected.

Now, that all seems to me, given the system, appropriate, fair, and straightforward.

What is inappropriate, in the minds of many including myself, is that it seems to need a criminal trial, rather than an inquest, to serve this necessary legal function of apportioning the enormous costs of compensation.

It seems to be particularly inappropriate in the case of poor Mr. Taylor. He repaired an airplane. Imagine a wise-owl supervisor, or some angel with perfect foresight, going up and saying “You can’t mount that there! It might fall off in Paris, and Concorde might run over it and lose a huge chunk of tire which causes a fuel tank to explode and dump fuel into the exhaust and lose power and crash!” and him saying “oh, yes, you’re right” and changing it, as it was in Dickens’ Christmas Carol.

Dickens notwithstanding, to English minds there just doesn’t seem sufficient proximity between act and event to justify a criminal-negligence connection. Dickens’ tale was, after all, a Carol. But there he is now, poor chap, with a criminal record, and a 15-month suspended sentence. Mr. Taylor, on behalf of many, probably most, Europeans, I am very, very sorry!

And that is why people are going to shut up and instruct their lawyers, rather than telling accident investigators all about everything they know, if accidents continue to be criminalised. Just as they are already known to do in rail accidents in Germany, for example.



Simulators and Veridicality in Airline Training and Pilot Currency Checks

9 09 2010

In his note in RISKS-26.15, Peter Wayner refers to the article Simulator training flaws tied to airline crashes in USA Today, 31 August 2010 (WWW version), which claims to have shown that «Flaws in flight simulator training helped trigger some of the worst airline accidents in the past decade» and that «More than half of the 522 fatalities in U.S. airline accidents since 2000 have been linked to problems with simulators».

I like to think I keep well up to date with commercial aircraft accidents, their analyses and causes, and am aware of simulator strengths and weaknesses. This suggestion struck me as somewhat thin. But if one reads the sentences literally, with their main verbs “helped trigger” and “have been linked to“, they do not speak of causes or causal factors. I can “help trigger” an accident if some USA Today journalist is so enraged by reading this note on hisher Blackberry that heshe runs a red light. And I can link USA Today with whom I wish simply by mentioning them in the same sentence in a Risks note. I am sure the newspaper intends stronger links than this, but it would be good to know what and how, and the article gives no clue. The NTSB uses the words “probable cause” and “contributing factors” in their conclusions and these terms have more precise meanings.

The article mentions three accidents: the November 12, 2001 American Airlines Airbus A300-600 loss of control on climb-out from New York; the December 20, 2008 Continental Airlines Boeing 737-500 takeoff loss of directional control at Denver; andthe February 12, 2009 Colgan Air Bombardier Q400 loss of control on approach to landing at Buffalo. The abstracts and links to the full reports are to be found on the NTSB WWW site as, respectively, DCA02MA001, NTSB Abstract AAR-10/04 and NTSB Abstract AAR-10/01. I invite readers to take a quick look at these very short synopses. These three accidents total 315 deaths and the USA Today article does not say which other accidents it counts.

Only the Denver accident causes and factors specifically mention simulators. The pilot flying lost directional control of the aircraft on the runway during takeoff, because of very high gusting crosswinds. The gust “exceeded the captain’s training and experience”, and according to the NTSB he failed effectively to use rudder to control the aircraft in the gust. The first contributing factor allows us to conclude that the crew did not receive timely and accurate information on the actual wind strength and direction. The second contributing factor is “inadequate crosswind training in the airline industry due to deficient simulator wind gust modeling“.

It is widely accepted in the industry that the most recurrent feature of most large-airplane commercial air accidents worldwide in the last few years has been loss of control. It used to be controlled flight into terrain, but it is now widely accepted that the Ground Proximity Warning System (GPWS) and its version Enhanced by terrain mapping using GPS and terrain maps (EGPWS) have reduced the incidence of such accidents considerably (although they still occur, as to an Airblue Airbus A321 on approach to Islamabad on 28 July, 2010 – see the Aviation Safety Net brief report).

The 2001 American Airlines accident was loss of control because of structural failure: the vertical fin separated from the aircraft. The NTSB found that the pilot flying had caused that separation by overstressing it through “rudder reversal” control inputs; contributing were the rudder control system design of Airbus, and American Airlines Advanced Aircraft Maneuvering [sic] Program AAMP. The NTSB heard both that AAMP discussed use of rudder to help recover from upsets, and that the FAA, Airbus and Boeing had expressed concern about this in a letter to American Airlines four years before. The pilot flying had been observed on a previous flight using rudder to control unwanted aircraft movement from environmental disturbance, and the captain on that flight, who gave evidence to the inquiry, had discussed it with him then. I refer Risks readers interested in more to the report, as well as to my paper The Crash of AA587: A Guide. The AAMP does involve simulator work, but a simulator cannot be known accurately to represent what would happen during unusual piloting rudder-reversal behavior because, well, until the accident nobody knew at what point airframe structure would fail (it turned out to be some one-third stronger than required by certification regulations)!

The pilot flying the Colgan Air accident aircraft reacted inappropriately to a stall warning, by pulling on the stick, and holding it back against the attempts of the automatic “stick pusher” system to push it forward. This resulted in the aircraft stalling at low altitude. Pushing the stick forward is the appropriate response. There was considerable discussion of the pilot’s aptitude, his level of awareness (relating to possible fatigue), and his overall Q400 training at Colgan Air. The NTSB remarked on features of that airline’s training program, which of course involves simulator work. But I don’t think it would be appropriate to conclude that there is anything much wrong with the simulators themselves.

Simulators do not necessarily accurately represent the behavior of aircraft close to the “edge” of their “flight envelope”, and they cannot be taken to do so for flight outside the envelope. Aerodynamicists study these “out of envelope” characteristics by use of wind tunnel models, but actual aircraft are not flown in flight test “out of envelope” except for certain restricted manoeuvres prescribed in the certification regulations (such as flying at “maximum operating airspeed” and initiating a 7.5° nose-down dive for 20 seconds, to mimic an overspeed excursion from cruise). For most “out of envelope” flight, aerodynamicists can make very well-educated guesses (from their wind-tunnel modelling) as to what might happen, but they are the first people to say that they are not at all certain. Nobody goes out to flight-test Boeing 747 aircraft in partially-inverted almost-vertical semi-spins, such as what happened to a China Air Lines Boeing 747 over the Pacific near San Francisco in 1985 (see the digitised version of the NTSB accident report in the entry in our Compendium. Incidentally, the human factors chair on this investigation tells me this was a watershed event for the investigation of human biorhythms and possible fatigue as potential contributors to accidents).

So there are limits to what simulators can achieve, and it is a matter for research how much “out of envelope” behavior can be usefully and veridically simulated. Since loss of control is now prominent amongst probable causal factors of accidents, it seems to me obviously worthwhile to perform this research. Where it will lead is anybody’s guess, as with most research. However, the NTSB’s concern in the Denver report is with situations that could be veridically modelled in flight simulators but currently are not. That could be, and probably should be, fixed.



Fully-Automatic Execution of Critical Manoeuvres in Airline Flying

3 09 2010

David Learmount’s semi-annual review of commercial air accidents has just appeared in Flight International (3-9 August, p34). There were three accidents to high-performance large commercial passenger jets: (1) a Ethiopian Airways Boeing 737-800 took off from Beirut over the sea at night and ended up in the ocean (25 January); (2) an Afriqiyah Airways Airbus A330-200 impacted the ground violently on approach to Tripoli’s RWY 9 (12 May); (3) an Air India Express Boeing 737-800 overran the runway at Mangalore (22 May). Recently, not included in David’s survey, (4) an Airblue Airbus A321 impacted high terrain while on approach to Islamabad (28 July); (5) an AIRES Colombia Boeing 737 landed and broke up on RWY 6 of San Andres Island (16 August); (6) an Embraer 190 of Henan Airlines impacted short of the runway and broke up on approach to Yichun (28 August).

Taking a random six months of accidents is not a sample conducive to pointing to trends using statistical methods; it is well-known amongst students of commercial air accidents that there are “fashions”, common features which cluster at a certain time, but which then reduce, without anybody necessarily doing anything much different. However, let us start here with the question that is the theme of this note:

Which of these accidents would likely have been avoided had the aircraft been fully automatically controlled?

Unmanned aircraft such as the military Global Hawk reconnaisance aircraft routinely fly complete missions under automatic control, from full stop to full stop. Other unmanned aircraft, such as the Predator «drones» used by the US Military in Afghanistan, and for US southern-border patrol, are remotely piloted, but have had control problems with the remote-piloting regime, as for example in this analysis of a US southern-border accident by Johnson and Shea. I want to emphasise here that we are indeed in the era in which fully automated long-distance flights are routinely flown (if only at present by the US Military, and, soon, other NATO allies with the Euro Hawk).

(1)Ethiopian had taken off into a «black hole» over the ocean at night, in other words into an environment in which there were no outside visual references whatsoever. The aircraft was performing a climbing turn, when it started to descend and disappeared from radar. There were electrical storms in the vicinity. The causes are not yet known, but certain factors have been proposed as hypotheses. The accident is almost certainly loss of control (LOC): no one presumes that the pilots committed suicide/murder. First, spatial disorientation of the pilots. This is a historical factor in the records of accidents in night takeoffs and landings in «black holes», such as over oceans. Second, a weather-related upset, say windshear of some kind causing loss of control (LOC). Such phenomena are also known historical factors. It is understood that no technical defects have been yet identified, but I also understand that the investigation is not yet complete.

If spatial disorientation of the pilots had been a causal factor, this would have been avoided by full automatic control of the takeoff and after-takeoff manoeuvring

(2)Afriqiyah was approaching RWY 9 at Tripoli, in clear weather but with reported «low, hazy visibility» (Learmount, op. cit.). «Information from the FDR and CVR indicates that there were no technical faults on the aircraft and fuel starvation was not an issue» (Learmount, op. cit.). Aviation Herald confirms this in its report, see in particular the update from the investigator’s information on 14 August. It impacted the ground heavily (even violently), some vertical distance below the approach path, indicating a high rate of descent. The impact was about 900m from the runway, according to Aviation Safety net’s report. The ground in the area of the airport is more or less flat. Although the VOR was NOTAMed unreliable, there is an NDB approach to RWY 9. The aircraft is capable through GPS equipment and NDB reference of constructing a «Continuous Descent Approach» (CDA) path, which gives a more-or-less constant rate or angle of descent to the point of touchdown, constructed by the Flight Management Systems using the exterior navigation aids, and it would have been able to do that at this airport at this time, as far as is now known. If the aircraft had been on a CDA, it would have been at about 200 feet altitude at this point (the arithmetic: assuming 3° approach path, about one-in-twenty, and a touchdown point 300m from the runway threshold, the aircraft impacted about 1200m from touchdown point, at which point it should have been at 60m above touchdown zone elevation (TDZE)).

Automatics are capable of controlling the airplane within a tens of feet of a given path, and routinely do so (indeed, they must do so in certain flight phases, such as cruise in european RVSM airspace). Given that there were no technical issues identified with the aircraft by the investigation, and violent weather was not a factor, a fully-automated CDA would have landed the aircraft on the runway; at least ensured it was not 300 ft below where it should have been assuming a normal 3° continuous-descent approach path.

(3)Apparently the Air India Express Boeing 737 «landed on RWY 24 just beyond the touchdown zone, in fair weather with no rain. It overran the runway end and plunged into a ravine (Learmount, op. cit.). According to the report by Aviation Herald, the runway has an ILS, required landing distance was 7500 ft and the runway length was 8100ft. There is no word yet, to my knowledge, on possible causal factors.

This seems to have been a routine landing, with no compromising weather. Such landings are routinely accomplished fully automatically, by the Hawk UAVs.

(4)The Airblue A321 had completed an ILS approach to RWY 30 at Islamabad, had turned right at low altitude and then left, to fly parallel to the runway. The crew is supposed at this point by many (with whom I currently concur, given the information available) to have been attempting a circle-to-land (CTL) manoeuvre, likely to land on RWY 12 (the reciprocal of the approach runway). CTL is a routine instrument flight rules manoeuvre, permitted from the ILS approach to RWY 30 as shown in this snippet from an approach plate, posted by «aterpster» in the PPRuNe discussion forum. In a CTL manoeuvre, the pilot, upon «obtaining a visual with» (i.e., seeing) essential parts of the runway or its environment, manoeuvres to land the airplane, provided the visual contact is continually maintained. If visual contact is lost, a routine «missed approach» manoeuvre must be immediately initiated. During the manoeuvre, the airplane must be flown within a given radius, just over 5 nautical miles, of a specified point on the airport. A diagram of this circling radius, overlaid on a plan of the airport and environment, appears in this post by «aterpster» in the PPRuNe discussion forum. A first approximation to the crash sight by, overlaid on a map with some of the navigation detail, including the CTL radius from a post by «aterpster» may be seen in this post by «PJ2», who updated his estimate of the approximate crash location some time later in this post. The crash site is reported by Aviation Herald to be about 10 nautical miles away, and in this early article in FlightGlobal, the WWW site of Flight International, to be 9.66 nm. The print version of the article (Flight International, 3-9 August 2010, p7) says 9.7 nm. There were reported to have been «no technical problems» in a later article in Flightglobal. So the impact site was at about twice the allowable CTL radius. The CTL radius encloses only flat land; the aircraft impacted «rising terrain», in other words a hill/mountain range nearby, but not so nearby as to constitute any danger to normal IFR operations.

There is a question, currently unanswered, as to why the EGWPS terrain-warning equipment did not enable the crew to manoeuvre to avoid the terrain.

Unlike a (presumed-)straightforward approach as at Tripoli, current commercial-aircraft automatics do not assist CTL manoeuvring in any reliable manner; the procedure should be hand-flown. However, it is a straightforward manoeuvre well within the capabilities of automatic control systems such as those on the Global Hawk to follow an ILS, and circle to land on the reciprocal runway, within the given limits. Automatics could have accomplished this manoeuvre within going outside the given CTL radius and therefore without a danger of impacting high terrain.

Furthermore, systems currently in test for the USAF, and shortly to become operational, perform automatic terrain-avoidance manoeuvres, even – expecially – during the kind of low-level manoeuvring performed by military pilots. The system is called Auto-GCAS and was extensively reported and flight-tested recently by Aviation Week (August 2, 2010, pp50-57). Here is a short blog on it by Stephen Trimble of FlightGlobal from last year.

Some proponents of EGWPS have suggested that avoidance manoeuvres in commercial air operations be automatically initiated and flown. This is well within current capability, as shown by Auto-GCAS.

(I have mentioned anonymous writers above. Here is what I know of them. “PJ2″ is someone I know, and with whom I have discussed accidents for a decade. He is a recently-retired captain for a major airline, where he was deeply involved in setting up the airline’s FOQA program. He is expert in aviation safety matters and I value his advice considerably. I do not know “aterpster”, but have read many public contributions by him. He self-indentifies as a former airline pilot who has been officially involved in accident investigations as a designated representative of pilots’ organisations.)

(5)Initial reports of the AIRES accident suggest that the aircraft landed short, for example this report in Aviation Herald. Weather is reported by FlightGlobal to have included thunderstorms in the vicinity. Some commentators on the PPRuNe thread have suggested that the main gear was torn off upon reaching the runway hard surface, which is elevated slightly above the surrounding terrain (one imagines the wheels sinking into software ground before the runway, and then impacting the hard runway construction).

It is not possible at this point to estimate the causal influence of the weather – one notes in the above references that the aircraft was reported to have sustained a lightning strike on final approach. But a landing of this sort to the TDZ is routine, even in stormy weather, for digital flight control systems. Providing, of course, they are sufficiently well insulated from the effects of a lightning strike.

(6)The Henan incident was also a landing-short, in reportedly benign weather – see for example the report in Aviation Herald – on a non-precision approach (NPA). The weather was reported as «foggy», but of course fog is incompatible with the kinds of atmospheric disturbances which might lead to control problems, and is not an issue for automatic control. A fully automatic landing was possible in these conditions, but not necessarily in the E190 accident airplane.

At this point, there is no public information about any technical problems with the flight. NPAs have been known for decades to be more accident-prone than precision approaches (ILS), but modern automation such as on the Embraer 190 can routinely perform CDAs, as discussed above with respect to the Afriqiyah accident.

None of the final reports are out, or expected yet, for any of these accidents. As things stand at present, the Ethiopian and ARIES accidents could have had the causal involvement of atmospheric disturbance, we don’t know. But other potential causal factors would have been mitigated if the manoeuvres had been performed fully automatically. In the case of the other four accidents, it seems quite reasonable to assert that, had the manoeuvres been performed fully automatically, outside the current capabilities of commercial-aircraft avionics but certainly within the routine capabilities demonstrated by Global Hawk UAVs, and the USAFs Auto-GCAS.

There are of course substantial safety issues with fully-automatic flight in civil airspace. It is correct to say that at this point it is not operationally feasible. For a recent review of some issues, see the forthcoming paper Computational Concerns About Integration….. by Johnson, to be read in two weeks at the SAFECOMP conference in Vienna.

So no one is yet suggesting, even for the medium term, pervasive fully-automatic commercial air transportation. But in light of the observations above concerning the six 2010 fatal accidents to large commercial jet aircraft, it does look as if it would be worthwhile to research whether standard approach and landing manoeuvres could be transitioned to routine fully-automatic execution.



Malware and the August 2008 Madrid Spanair Take-Off Accident

27 08 2010

On 20 August 2008, a MD-82 aircraft of the airline Spanair crashed on takeoff (TO) from Madrid-Barajas airport. The high-lift devices on the wing had not been properly configured to give the necessary lift on takeoff, and the aircraft was unable properly to lift off as planned. See Aviation Safety Net’s report of this accident for more details.

There had been a maintenance issue during a previous attempt at departure, and maintenance personnel had addressed this issue. In effecting the repair, however, the takeoff configuration warning horn, which aurally warns the crew that the high-lift devices are not appropriately configured for takeoff, had also been disabled. The crew is required, in the pre-take-off check list which they have to perform, to check that the aircraft is appropriately configured for takeoff, and it seems that they did not do so at the second departure: they performed some of the items, but not the full list.

Spanair uses a ground-based computer to process aircraft logs for maintenance issues. The fault which caused the accident aircraft to return to the gate had apparently occurred more than once the previous day, and been logged. But the press has recently reported that malware in this computer delayed the processing of reports, and so maintenance was not aware of the problem the previous day, when they would have been able to correct it, before the fated flight. The Press reports have thereby connected this malware with the accident. See, for example, a summary in english of the reports by Daniel Johnson on the University of York Safety Critical Systems Mailing List.

Brian Reynolds commented on these reports that “This is totally bogus” and clarified that he meant that it is “totally bogus” “[t]hat a virus or Trojan in a ground maintenance computer is casually related to this incident.

Reynolds seems to be denying the claim that malware in a ground-based maintenance computer is causally related to the accident. But he omitted to say what his criterion for causal-relatedness is.

I have one: the concept of necessary causal factor, proposed in 1973 by the philosophical logician David Lewis, who credits the concept to David Hume (his “second definition” of cause). I took over Lewis’s semantics 15 years ago for use in failure analysis.

According to this semi-formal, objective notion of causal factor, there is demonstrably a chain of causal factors leading from the presence of the malware to the accident. According to this concept, Reynolds is provably wrong.

So now let me show this.

Here is the Counterfactual Test:

Let A and B be events or states.

A is a necessary causal factor in the occurrence of B just in case:

If A had not occurred, then B would not have occurred.

This last sentence is called a counterfactual (or contrary-to-fact) conditional. “Conditional” comes from the “if…then…” form; “Counterfactual” from the fact that A and B did as a matter of fact happen, and one is supposing what the world would then have been like had A not occurred. In order to determine this, I adapt the Lewis semantics: suppose A had not occurred, but the world stayed otherwise as similar as possible to the actual state of affairs that pertained. Did B occur in this possible state of affairs? Most often, we cannot answer with absolute certitude “yes” or “no”, but it turns out that we can most often answer “most likely, yes”, or “most likely, no”. The Counterfactual Test is to ask this question I just posed. If the answer is “most likely, yes”, the Counterfactual Test is “passed” and A is a necessary causal factor of B. If the answer is “most likely, no”, then A is not a necessary causal factor of B. We have found the Counterfactual Test to be very useful in complex engineering failure analyses.

To show a causal connection between the presence of malware on the maintenance computer and the accident, here are five instances to check with the Counterfactual Test:

1. Had the malware not been present, the fault causing the phenomenon would have been noted by maintenance personnel in a timely manner (let us say: at latest, end of the previous day).
2. Had the fault causing the phenomenon been noted by maintenance personnel in a timely manner, it would have been appropriately repaired before the accident flight.
3. Had the fault been appropriately repaired before the accident flight, the TO-configuration warning would have sounded on the accident flight.
4. Had the TO-config warning sounded during TO on the accident flight, the TO would have been aborted when the warning sounded and the aircraft properly configured before subsequent TO.
5. Had the TO been aborted when the warning sounded, the aircraft would not have crashed as it did.

I consider all of these counterfactuals to be true according to the Lewis semantics. It follows:

1a. The presence of the malware was a necessary causal factor in the lack of timely awareness of the fault.
2a. The lack of timely awareness of the fault is a necessary causal factor in lack of timely repair.
3a. The lack of timely repair is a necessary causal factor in the TO-config warning inhibition.
4a. The TO-config warning inhibition is a necessary causal factor in continuing TO to loss-of-control.
5a. Continuing TO to loss-of-control is a necessary causal factor in the accident.

So, there is a chain of six causal factors, chain-length five, connecting the presence of malware to the accident. QED.

I emphasise, just to avoid misunderstanding, that these are by no means the only causal factors relevant to the accident: that the crew failed adequately to perform the pre-takeoff check list on the accident flight is most certainly a necessary causal factor in the loss of control. The reader is invited to try out the Counterfactual Test to assure himherself of this.

Applying the Counterfactual Test rigorously throughout the list of potentially-relevant factors, to see which ones are indeed causally relevant and which not, is the core of our analysis method Why-Because Analysis (WBA). For those interested in seeing relatively quickly how we perform WBAs nowadays, there is available a case study on how to perform a WBA using the SERAS Reporter and SERAS Analyst tools. Here is some general info concerning our experience with Why-Because Analyses. Typically, depending on the level of detail provided by the investigation, a detailed causal analysis (which we represent in graphical form as a Why-Because Graph) ends up showing a hundred to a couple of hundred individual factors, of which a quarter to a third are “root-causal factors”, that is, causal factors which are not regarded as themselves having pertinent causes. So WBA also includes a fair amount of bookkeeping, or “complexity control”, or whatever one wants to call it. For example, given a WBG with a couple hundred items, one would assemble these causal factors into a small number of subgroups, and give these subgroups appropriate titles, to provide an “executive summary” of the analysis. The SERAS Reporter and SERAS Analyst software is available as freeware from Causalis Limited .

We can well expect a full WBA of the Spanair accident to contain between a hundred and a couple of hundred factors.



Understanding Aerodynamics of Stalls

28 07 2010

Recently, most commercial transport airplane manufacturers have been revisiting their FCOM procedures for “stall recovery” (actually, procedures avoiding that an approach to stall turns into a stall). This may be related to the spate of recent accidents in which commercial airplanes have been stalled: Colgan Air in Buffalo, Turkish Airlines in Amsterdam, XL Airways in Perpignan. Such a spate of loss-of-control (LOC) accidents is a sudden new development in aviation accident statistics. People are concerned it might signal a trend and are looking for possible causes of this trend, if it is one.

A discussion on such matters started on the Professional Pilot’s Forum PPRuNe on a thread with the ungrammatical title of New (2010) Stall Recovery’s @ high altitudes. I agree with the moderator, who goes by the handle of John_Tullamarine, that the discussion has been stimulating, although I had my doubts as it started, which readers of the thread may observe.

The discussion has been enlightening in a number of respects. One aspect which startled me is the degree of understanding of stall – what it is, when it happens, and its functional relationship with other aerodynamic parameters. The stall is one of the most important, if not the most important, phenomena with which pilots must cope (preferably by avoidance). As is buffet. I conclude that such understanding amongst line pilots could be easily improved. There are graphs used by aerodynamicists; they are all more or less the same shape no matter what the airplane. You can find them in any intro-aero book, say John D. Anderson Jr.’s Introduction to Flight, or Richard Shevell’s Fundamentals of Flight, without any numbers on them, as well as in many FCOMs with numbers on them. Why are they not studied in type training and the knowledge tested?

It could be – has been in the thread – argued that “pilots don’t need to know” such things. As off and on a professional educator for the last few decades, I have participated in enough discussions about what technical practioners need, respectively don’t need, to know. I have seen what happens, at enough places. We as a profession are now – have been for a decade or two – giving computer science degrees to people who can’t really program very well, at least not according to the standards we used to have. Do I think this is – ever – a good idea? No, I don’t. Do I think people should be professionally flying complex airplanes without understanding the aerodynamics presented in the FCOM? No, I don’t. Although I imagine not everyone will agree.

My practical philosophy of education is as follows. My default answer to what people need to know is: everything. That said, there are practical limitations (of ability, of time) which entail prioritising knowledge and intellectual skills in using that knowledge (which, while related, are not the same thing).

Can we reduce such knowledge to algorithms, to operational instructions, as has been suggested in the thread: “if this happens, do Y”? I am sceptical. Choosing the correct action as a pilot requires appropriate situational awareness.

There was, for example, considerable debate about tail-plane stalls and training syllabuses following the Colgan Air upset in Buffalo. Colgan Air used a NASA video about stalls in icing as a training aid, and this video emphasised so-called tailplane stalls due to icing, for which the remedy is apparently inconsistent with the action required for a main-wing stall. The Q400 in the Buffalo crash is not at all susceptible to tailplane stalls, and is equipped with a stick pusher to prevent main-wing stalls. However, the pilot flying pulled the stick back, overpowering the stick pusher repeatedly, which is exactly the wrong thing to do if the main wing is on the point of stalling (it is, after all, the purpose of the stick pusher to do the right thing in this situation) but might be appropriate for tailplane stalls. It was therefore questioned whether the pilot had appropriate awareness of the aerodynamic situation the aircraft was in, and it has been concluded that he did not.

Having appropriate situational awareness requires understanding the phenomena, as well as understanding the limits of understanding. Needing to distinguish between what some people call “stalled” (namely, at or just beyond the maximum value C_L_Max of the coefficient of lift, C_L, when large parts of your wing may still be flying) and “fully stalled”, for example (when none of your wing is flying). The question arose in the thread whether one may use ailerons to lift a dropped wing at stall. The obvious answer is that you can if that part of the wing is still flying, but you most definitely should not if it is not. How do you tell which situation you are in? Trying and seeing is not a wise option.

Consider, for example, FCOM procedures concerning “stall warning” on a popular large airplane after lift-off (A330, 3.04.27 P 5a):

THRUST LEVERS ….. TOGA
At the same time:
PITCH ATTITUDE …..REDUCE
BANK ANGLE………..ROLL WINGS LEVEL
SPEED BRAKES……..CHECK RETRACTED

This assumes that you are at high angle of attack (AoA) but not yet stalled, and that the ailerons are still flying. The stall warning may also go off at high altitude, in which case: “relax the back pressure on the sidestick and reduce bank angle, if necessary”. In both cases, it is assumed that the wing is flying, but that bits of it are telling you they might not for much longer, and you need to back away from that point. These procedures obviously won’t help much at all if your nose gets to be way up in the air at 45°-60° of pitch, as happened at Perpignan with a related airplane.

The answer to telling which situation you are in is probably found in a good intuitive understanding of the aerodynamics in the FCOM, and for that one needs a good basic understanding of aerodynamics in general. One illustration of this is the suggestion that was made in the thread on a potential means of discriminating stall buffet from Mach buffet: the feeling of the frequency of the buffet.

This also illustrates the limitations of simulation, a topic on which it seems not all thread contributors are clear. It seems that many people still seem to think that flight simulators, including the expensive moving kind used for airline pilot training and recurrency, are veridical around upset scenarios. How on earth do these people suppose simulators can reproduce veridical buffet? That some aerodynamicist has sampled the frequencies of buffets in the wind tunnel, and given that to some simulator programmer to reproduce, as well as some engineer to make sure that none of it coincides with the resonant frequencies of the simulator? And most of those wind tunnel models that generate the data fly without horizontal tail pieces; what is the effect of the tail? Mostly, one doesn’t actually know, but extrapolates from one’s experience as an aerodynamicist. I feel that a basic understanding of aerodynamics would cure many illusions about the veridicality of flight-simulator behavior outside the normal flight envelope.

Whatever one thinks about what pilots should know or not know, it seems to me a good idea to clean up the vocabulary, suggested through the following examples.

“Stall” is a term of art: for example, sometimes it means the same as “at C_L_Max”, and sometimes it means “the point at which buffet is severe enough to discourage further increase in AoA” (cf. the definitions used in the airworthiness certification document, CS 25). Does being at or over the stall mean you have no lift? No, actually you may have more lift than in most other regimes of flight (just over C_L_Max) even though you might be shaking severely, or you may have much less (AoA way over that for C_L_Max).

Another terminological inexactitude resides in the terms “low-speed stall” and “high-speed stall”. The first refers to the situation in which the AoA is too high for the speed; the latter often refers to a transsonic overspeed situation, in which lift is reduced because of the formation of shock waves over certain parts of the wing, which waves, because they form at or near the leading edge, reduce lift forward and thereby move the center of aerodynamic lift rearwards, leading to a nose-down moment about the center of gravity of the aircraft, which gives nose-down pitch or “Mach tuck”. Use of this terminology leads one to the anomalous-sounding phenomenon of the “low-speed” stall at “high-speed”. Maybe the terminology “high-alpha stall” or “high-AoA stall” would be preferable to “low-speed stall”, and to use the word “transsonic” rather than “high-speed” to indicate effects of shock waves on lift?

Another vocabulary hang-up occurred in the discussion on the thread of V_s1g, or stall speed at 1g. Is it a constant speed or not? If not, with which aerodynamic parameters does it vary?

V_s1g would occur when lift at C_L_Max is equal to weight (W). Lift = q x S x C_L, where q is dynamic pressure and S is an area term usually taken to be the area of the wing planform. So at V_S1g, W = q x S x C_L_Max. Given that q = ½ x density x V^2, we can solve for V: V_s1g = Sqrt( (2 x W)/(density x S x C_L_Max)). S is obviously constant for a given airplane. What about C_L_Max? If you can ignore compressibility effects (i.e., below about 0.3 Mach for most wings) then C_L_Max is effectively constant, as is the AoA at which C_L_Max is achieved.

Now consider density. Air density obviously varies with altitude, indeed with the properties of the atmosphere on the day and at the place. So if one wants V_s1g to represent the true airspeed (TAS), then this obviously varies, but with a bunch of parameters not measurable with equipment on board most commercial aircraft. However, aerodynamicists like to talk Equivalent Air Speed (EAS), in which inter alia density is defined as sea-level standard-atmospheric density, 1.225 kg per m^3 (kg.m^(-3)).

So V_s1g, as EAS, varies only with (the square root of) weight. Weight obviously varies (with load, fuel burn and so on) but it is not an aerodynamic parameter, and is usually considered constant when talking aerodynamics. It follows that V_s1g, expressed as EAS, is constant.

However, V_s1g, as indicated, say, in the A330 FCOM (3.01.20 P7) is expressed in Calibrated airspeed (CAS), which is the pitot-static-measured airspeed corrected (usually digitally) for the effects of how the sensors are positioned in the air stream, and expressed in CAS there is a correction for pressure altitude, starting at about 20,000 ft for lighter weights, and going down to about 5,000 ft for heavier weights.

So, as a “practical” matter, is V_s1g (at fixed weight) constant or not? As an aerodynamicist, liking EAS, one would say yes, as a pilot, preferring CAS because that is what one sees on the airspeed indicator, one would say no. That could be a source of genuine confusion at times.

A more obvious but less insidious vocabulary hang-up is Mach number. Is it a speed? Strictly speaking, no. It has no units (it is a ratio of speeds: airspeed to the speed of sound, which varies with air temperature); whereas speeds have units of length per time unit (m. s^(-1) or ft . s^(-1) ). However, in response to a question “how fast were you going?” one might well respond “at 0.8 Mach”, and indeed Mach is used in preference to airspeed to adjust for many situations at high altitude. For example, limiting dive is expressed as both speed and Mach number, as is turbulent-air penetration, maximum operating (max. cruise), and so on.

Other vocabulary hang-ups occurred in the thread when talking about “approach to stall recovery” and “stall recovery”, and these I feel are insidious. Some correspondents (including the thread originator) insist they have been practicing “stall recovery” in an airplane with a stick pusher, despite the obvious point that if the airplane has a stick pusher and you respect the pusher, it is not stalled. Indeed, many “stall recovery procedures” are more accurately described as stall avoidance procedures, or approach-to-stall recovery. Surely such confusion would be resolved through a little aerodynamical knowledge and some common sense about safety-system design?

One correspondent, when asked repeatedly whether he thinks that test pilots have been going up and stalling Airbus airplanes, in order to rewrite the “stall recovery” FCOM procedures (actually “Stall Warning” in those for the A330 referred to above) and to calibrate simulators, wisely declined to answer. As a veteran pilot, with the handle 411A, said, Has anyone here actually stalled a large swept-wing airliner? I[f] so, what were the results?. Another, Airclues, replied In the early 80′s I was co-pilot on several C[ertificate] of A[irworthiness] air tests on the Boeing 747 when a full stall was completed (I believe that the UKCAA was the only authority that required this) and described his experiences. In other words, actually high-alpha-stalling large commercial aircraft, even for certification, is ancient history. I very much doubt it was done just to rewrite stall avoidance procedures and calibrate simulators.

A useful discussion indeed, but I suspect it will take more than a pilots’ forum thread to sort these issues out.



Screwy Reasoning and Its Study

7 09 2009

Those of us interested in commercial aviation accidents have to deal with a lot of what I shall call screwy reasoning.

Last week, I read a September 2 article in The Times on the crash of AF447 and its aftermath which I felt was somewhat screwy. It suggested that Air France’s attempt to introduce specialised training for the scenario of loss of airspeed data in cruise on A330/340 machines had “provoked anger”. The angered party was unspecified, but the initial paragraph had a “pilot’s union” “accusing” AF of trying to cover up the cause of the AF447 crash.

At the end of the article, the author, Charles Bremner, says that simulator training for “speed problems” at cruising altitude had not previously been offered by Air France, but this is now being done “at the request of all airlines’ unions.”

Assuming the pilots’ union was the party to have been angered, there is an obvious contradiction between a union requesting training, and being provoked to anger by (apparently) the same. So, if not the pilots’ union, it would be good to know who it was who is being angered by Air France’s specialised training.

Not that Air France has much choice in the matter. European Aviation Safety Agency (EASA) Safety Information Bulletin 2009-17, issued on 9 June,2009, which may be found by searching EASA’s Safety Information Bulletin list, requires inter alia that “familiarisation of flight crews with unreliable airspeed indication procedures should be provided through adequate training.” It is hard to imagine how anger can be provoked through airlines following a regulatory safety directive.

The Times article began by suggesting the the Air France pilots’ union had accused “accident investigators”, by which I take it is meant the BEA-convened AF 447 investigation committee, of “covering up the cause” of the crash of AF447. I doubt this is correct, whatever the union is annoyed about. One can only cover up a cause if one knows what the cause is, and I doubt that any professional thinks the BEA or AF or anyone else knows.

Bremner has written quite a lot on AF447. He is a pilot (a private pilot I take it) and he does have useful information on the negotiations between stakeholders (pilots’ unions, airline, government, manufacturer, investigating agency, maybe even passengers….) to share. However, he seems to do so in a selective way which I often find unhelpful, as in this case, and his phraseology is sometimes unfortunate, also as here.

I wrote some comments on the article and the discussion, which appear in the Commentary section under the article. Colleagues who look at the aviation discussion forums have been amazed at how often the 1988 Habsheim accident comes up whenever anything Airbusy surfaces, and so it does in Commentary to this article. But, in these contributions, it is no longer Habsheim. It is the “Paris Air Show”. One wonders what, if anything, is going on in these people’s heads.

Well, there are partial answers. Sociologists have been studying people’s actual reasoning behavior in detail, but I was always missing a link into the literature. Now I have one, courtesy of the New York Times’s TierneyLab blog, which discusses the issue of who was first to the North Pole. One possible answer: Roald Amundsen in an airship in 1926. On the ground – well, ice – it might well have been Ralph Plaisted in 1968 on snowmobile, and Wally Herbert a year later by dogsled.

Tierney points to research on belief formation and retention, by Steve G. Hoffmann and colleagues, who used the Saddam-9/11 non-link to conduct a survey and then interview some responders, and wrote an article about it in the journal Sociological Inquiry, published in 2009. A number of belief formation and retention mechanisms have been proposed in the literature, and Hoffmann et al consider them. I find the paper very readable and it contains a number of references to related work. So this is a way in to studies of reasoning phenomena, for those of us who might be curious about it.



Thoughts on Engineering Communication (with a bit on Ice Particle Icing and AF447)

21 08 2009

I have been thinking recently about professional engineering communication.

I was reminded once again of the lack of consensus by Nancy Leveson’s comment that “[t]he type of limited interaction that is possible by email is just not conducive to communication” as well as her regret at being “… pulled into one of these web debates because it takes so much time and produces so little”, on the University of York safety-critical systems list http://www.cs.york.ac.uk/hise/safety-critical-archive/2009/0369.html. I don’t agree with this view on email. I am a heavy user of email, both for longer essay-style pieces (although I am now moving more towards blogging) and for short exchanges. I consider e-mail lists such as that run by York to be an appropriate and helpful form of professional communication. I might agree partially with her view on WWW forums, because I find some forms problematic for professional purposes, but then again I think some of them work well (for example, the York list archive is a WWW forum).

I think no one medium available to us satisfies all the communicative needs of engineers in a developing field. I propose that prowess in engineering communication, traditionally required for evaluation of academic personnel, be based on more than traditional journal- and conference-paper publishing.

Advance in engineering depends on communication somehow. If one person in the world finds out how to solve engineering problem X, then unless heshe spreads the word, or word gets around via hisher customers, that technique remains hidden and others will not use it to solve problem X.

For the solution of specific engineering problems, or for the communication of engineering problems themselves (such as the “hot topic” of ice particle icing), it seems to me that traditional journal and conference publication works quite well, even though there are all sorts of problems with peer review procedures.

However, for discussion of current practice, or historical practice, and for discussion in general, declamatory articles such as those which appear in journals or conference proceedings don’t work that well. Neither do the magazines (because articles are by their nature declamatory). Journal or magazine letters pages also don’t work that well – witness the recent interchange between Keith Miller and myself on the Gotterbarn/Miller paper in the June 2009 IEEE Computer, which proceeded much more rapidly and fruitfully, but also privately, by e-mail than it did through the letters/reply section in the magazine (IEEE Computer, August 2009). See previous blog posts here for the public exchange.

I hold discussion to be very important in the engineering profession. Witness, if you will, again, the Ladkin/Miller exchange. Had this not occurred, Messrs Gotterbarn and Miller would be on record as holding that the recent A330 incidents were an instance of SE ethical problems of a certain sort, whereas they now agree that the issues are more subtle, if not other, than they originally proposed. A change of view arrived at through discussion.

Consider another example: how does one best handle issues of best practice, such as formal-language specifications versus natural-language specifications? Such issues need discussion: some think “natural language specs are best”; others think “formal language specs are best”, and there are different communities of practice built around these views. If you work in safety-critical electronics in the European railway industry, you must use natural-language requirements specifications because the standard says so, even though you might think this is a load of junk. Whereas if you work in one of the more prestigious sectors of avionics, you would likely do formal-language specs, even if you were a nat-lang-spec enthusiast.

Some people think the standardisation processes suffice for communication of best practice. Others think, as I do, that the neither the standardisation process nor the emission of standards suffices to communicate best practice. Indeed, I would go further. I also do not think the emission of standards necessarily embodies best practice, as my contributions over the years on the functional-safety standard IEC 61508 on the York list may indicate.

So what does embody best practice and how does one tell? Well, one thing to observe about the engineering profession is that there is no one way to skin a cat. There are many, and the best engineers will be intimately familiar with all of them, or at least with as many as they can be. One engineer may prefer one way, another engineer another way. What could they suggest to a third engineer, also attempting to skin a cat?

Engineer A: “Do it my way.” Engineer B: “Don’t do it his way; do it my way”

or

Engineer A: “Do it my way.” Engineer B: “Yes, do it his way; don’t do it my way”

or

Engineer A: “I do it this way, but any other way will work. However, I can help you best with my way.”

All these answers are possible from responsible engineers, who would have taken into account their interrogator’s environment and that of hisher task.

Engineers must interact this way. It is an important part of what they do. It is communication, it is necessary, and the question I wish to address is how, using what form, it may best be accomplished.

Let’s make it more concrete, with a concocted example whose content appears regularly on the York list.

Question: “I am building such and such a safety-critical system and we have to use the programming language C because that is what we have a compiler for, for the chosen hardware. Is this OK or should I veto the project.”

Answer 1: “Your source code, if it is written in C, will have no well-defined unique meaning. C compilers have odd quirks such as producing different object-code behavior depending on which order one writes the arguments to a test, and ………. So you will not be able to tell exactly what your object code does and thereby not be able to assure the behavior of your system to the required degree. To get the highest degree of assurance attainable by any practice to date, use, say, SPARK and an Ada compiler to avoid the problems with C detailed above, and to take advantage of the documented quality of SPARK code development. This may necessitate changing the underlying hardware if there is no Ada compiler targeted to your hardware. If you can’t change the hardware then recommend SPARK for the above reasons and at the same time veto the project.”

Answer 2: “There exists enough experience with C and C subsets such as the MISRA subset and C static analysis tools that you can be fairly assured of a more-or-less unique meaning for what your object code does, providing you pay a lot of attention to the known weaknesses of C constructs as listed in [a ten-year-old book] and you are careful about your choice of compiler and carefully research the known problems with the compiler and avoid them. The available analysis tools aren’t perfect but they are pretty good for most purposes. And, besides, Engineer Y has shown one can [read: he can] do this in a significant project. And, besides, everybody does it. And, besides, if you are stuck with this hardware, as you say you are, you have no real choice.”

Ripost from Answerer 1: “Sure, Y is one of five people, or fifty people, or one hundred people in the world that have a track record of doing this. Hire him. Or one of the other 5/50/100. Then you might be OK. Else, do it the way I said.”

Now, imagine you are buying the car in which this equipment is installed, one of a few thousand, or a few hundred thousand, or a few million built, for your family. Wouldn’t you rather that such a discussion had taken place in a highly prestigious forum, which as many eminent engineers as possible read, and can contribute their views, as required? And that some sort of consensus had developed as to what the questioner should do, and that some sort of assurance was available that heshe had done it?

So what would that forum be? The York mailing list? Not really- not all professionals read that list, and some of them think about it that “[t]he type of limited interaction that is possible by email is just not conducive to communication.” Leading journals that everybody reads? Well, it doesn’t happen. Or, better said, in my experience the journals in which such things appear are not much worth reading. Why is that?

It is that way, I propose, because this kind of discussion is not accorded the prestige which, say, journal publication of research accrues. In my view, a way should be found to value participation in insightful and fruitful discussion as prestigiously as journal publication, because such discussion is equally vital to engineering, as I hope I have just shown.

Well, a gainsayer might say, Engineer A can publish hisher view in a journal. Then Engineer B can reply. And then Engineer C, and so on.

I don’t think that will work in general. Consider the following recent example.

On June 1, an Air France A330 crashed into the South Atlantic in an area of unstable weather, having sent a series of cryptic maintenance messages from the Central Maintenance Computer as its last communications. Bits of the aircraft have been found, but not the bits most important to knowing why it went down.

Somebody found and published a report from another airline of a flight which had suffered similar phenomena at a similar altitude. And then other reports surfaced. People who had access to these reports had their own professional interests which would induce them to certain behavior, such as keeping them quiet or broadcasting them. Broadcasting is the only stable state: you cannot keep something under wraps once it has been broadcast. One of the major players is an anonymous broadcaster, a WWW site, called eurocockpit.com. The advantage of broadcast in this instance is that all the various pieces of data, available only to some people and not to others, have been brought together into the public domain.

The result of this communication activity has been that, probably within a month and certainly within two months, almost all pilots are aware of and wary of a phenomenon which on May 31, 2009 was not known to exist: high-altitude ice-particle icing of air data sensors. There were individual incidents, indeed many, but nobody knew about them all, and if you just know about one or two perplexing incidents there are many possible causes of it or them. But when you have a dozen, or a couple of dozen, and another one occurs as you are wondering, then it concentrates the mind wonderfully. The result is that EASA has published a proposal for an Airworthiness Directive aimed at replacing all those sensors thought to be more susceptible to ice particle icing than others.

The odd thing about this example is that the airplane in question has been in service for well over a decade, indeed much nearer two, and these incidents have apparently only occurred since March 2008. Explain that one! (Anyone who says “global warming” must go stand in the corner for an hour :-) )

My view is that you cannot explain it at the moment, but that the communication behavior around whatever symptoms of whatever phenomenon we are talking about here (likely ice particle icing) could have been different from what it was up to the loss of Air France flight 447 on June 1, 2009, which apparently suffered these symptoms. And maybe it could have been different in such a way as to have led to measures which could have averted the loss of an airplane and its occupants? A fine article on this history, which raises this question, has recently been written by Jens Flottau and appeared in Aviation Week and Space Technology on August 10, 2009: Response to Airbus Pitot Tube Incidents Under Scrutiny.

To be clear: I am talking here about forms of communication which we use, and not at all about any specific individual or organisational behavior. I am not suggesting that any individual, group or organisation did less than the very best they could about the evolving issue. Indeed, this remark serves to strengthen my suggestion that the communication forms themselves can give us a level of control over engineering developments, such as experiencing, recognising and then handling ice particle icing of air data sensors, which we do not currently possess.

It is not just ice particle icing of air data sensors. Ice particle icing caused engine problems to one type of engine on the BA146 airplane. It was not known to occur to others, but some Boeing and Honeywell engineers looked at incidents of surge, flameout and other anomalous events at altitude on other airplanes and came to the conclusion that they were due to icing phenomena at high-altitude, sometimes in cloud which was so thin that it barely hindered visibility. This stuff has appeared in the journal literature: see The Ice Particle Threat to Engines in Flight by Mason, Strapp and Chow, 2006, which refers to Cloud Particle Measurements in Thunderstorm Anvils and Possible Threat to Aviation by Lawson, Angus and Heymsfield, 1998. And in 2006 there came NTSB Recommendations to the FAA. But there were still 20,000-hour long-haul pilots (for all I know, still are), a group of people to whom this phenomenon would surely be of great interest, who apparently do not know of this work. One said even as late as a month ago that he does not accept ice forms below -40°C: http://www.pprune.org/tech-log/381558-ice-crystals-2.html#post5070024, and http://www.pprune.org/tech-log/381558-ice-crystals-3.html#post5074951.

It is through the communication of incidents, each of which was previously known only to a few people, many of those people being different people, that a dangerous phenomenon, ice particle icing of air data sensors at high-altitude and cold temperatures, has been identified. This is a significant engineering achievement. How did it happen? WWW. E-Mail Lists. WWW Forums. And, also, traditional methods of communication amongst appointed representatives of involved organisations. But by no means solely the latter.

So, given that discussion and communication is vital to engineers, and the traditional form of journal publication does not suffice, how should the contribution of, say, a research engineer be assessed? (For purposes, for example, of awarding a prize, or awarding tenure, or of getting an academic job in the first place.) I propose that such assessment also look at participation in these other essential communicative activities and not just traditional publications. I agree there is a problem of parameters and quality control. Just getting hits on your blog isn’t necessarily a good measure; but getting the most hits on your blog of anybody working in your area just might be.

To finish up: what forms of communication work, and how?

1. Obviously peer-reviewed journals and conference papers work.

2. Obviously WWW sites with journal-style papers work.

3. I would contend that moderated, selective forums such as the Risks Forum work.

4. I think some sorts of blogs work. I am sceptical of the frequently-written 200-word anecdotal variety of the sort the IEEE is promoting , but I do like the weekly-essay variety employed to such notable effect by people such as Nobel laureate Gary Becker and Judge Richard Posner at the University of Chicago in their blog. It is by following such blogs for a while that I believe I have come to understand what they are good for, and have started trying to emulate.

5. For specific purposes, such as the wider collection and dissemination of controlled information, carefully-moderated anonymous forums such as eurocockpit.com

These are all declamatory forms, with only limited possibility, asymmetrical, for discussion. What works for the kind of essential discussion I illustrate above?

6. Not anonymous WWW forums. I don’t yet know a forum which can be successfully followed unless one has lots of free time and a huge tolerance for purposeless commentary or for poseurs. For example, I have made two unsuccessful attempts to develop a presence on PPRuNe, the professional aviation people’s forum, and PPRuNe seems to me to be head and shoulders above anything else in which one can discuss aviation accidents. The main issue seems to be that moderation attempts are often overwhelmed by the task on high-interest topics, and no one seems to have a good solution to this phenomenon.

7. Yes, non-anonymous controlled-access WWW forums. Such as the York mailing list. (Note that its archive makes this list into such a forum.) A colleague to whom I once mentioned that I had been contacted to write a textbook on safety suggested that all I had to do was collect what I had written on the York list over the years and organise it. (Yes, well, the organisation part. It was simpler to start writing from scratch :-) )

8. Something that does not exist, but well might. Peer-reviewed or moderated (same thing, maybe?) non-anonymous forums for publication of essays and for discussion. There is a fundamental tension between encouraging comment, insight and debate, and insisting on quality. Quality means taking time over composition, which in turn discourages people from contributing. There are such forums at present, for example the functional safety area on the IEC WWW site, but they are not hives of intellectual activity.

9. Jan Sanders suggested using video. A forum in which engineering questions could be put, and engineers give their answers verbally in a video, and videoconferencing could be used to resolve, or at least further to discuss, discrepancies. Like written forums, this would be moderated to ensure quality. The advantage of videos would be that it takes many people less time both to record their views and to receive the views of others through speech than it does through writing, and speech is most effective when one sees the speaker speaking.

I am a fan of debating, I like mailing lists and, newly for me, blogs and I wish there were some way of professionally assessing contributions to these forms of communication.



AF447: Issues Clarified by the BEA Report

4 08 2009

There are some significant issues which are clarified by the BEA’s preliminary factual report, issued at the beginning of July: specifically the uncertainties and certainties in the meaning and partial interpretation of the maintenance messages received by ACARS; the question of structural integrity; the attitude and flight path of the aircraft on impact with the ocean surface; and the weather phenomena in the vicinity of the flight at the time it was presumed to be lost. The ACARS messages indicate strongly that there was a situation with unreliable airspeed indication. Since the accident more incidents of unreliable airspeed indications at high altitude have come to light. I comment on these continuing developments in a separate post. I comment here on structural integrity and what it tells us about how the airplane may have behaved; weather and position; contacts with ATC; and the interpretation of the ACARS maintenance messages received.

The vertical tail of the aircraft was the major piece of structure found during the search. It had separated, taking some parts of some fuselage, including box-section pieces, with it at one attachment. The question arose whether it could have separated in flight. Our collaborator, the aerodynamicist Clive Leyman, showed in work in June that it would not be possible above say FL 170- FL 200 to generate enough dynamic pressure on the vertical stabiliser of the A330 to cause it to fail. And even at that general altitude, overspeed would be a necessary contributing factor to any failure. His conclusion was based on dynamic-pressure calculations, based on the datum that the A330 vertical stabiliser failed during destructive testing at 2.0 times design load. The aircraft was cleared at FL 350. So we knew here from Clive’s work that an upset would have been a necessary precursor to loss of structural integrity. The main question thus is: what would have caused an upset?

Indeed, the BEA determined from inspection of the retrieved wreckage – over 600 individual pieces – that the aircraft hit the water intact, in more or less level attitude with a high vertical rate of descent. This does not conform with the flight path of an aircraft under full control. It suggests, indeed, that the aircraft was aerodynamically stalled when it hit the ocean surface.

The BEA determined, from the loading of the aircraft on takeoff and the estimated fuel burn over the flight profile that the aircraft had an estimated weight of about 205 tonnes and CG between 37.3% and 37.8% MAC at around the time of disappearance. The half-percentage variation in CG estimate comes from the fact that fuel is pumped around between fuel tanks at cruise, to optimsie the lift-to-drag ratio of the aircraft, and there is a limit of 0.5% MAC on the CG shift allowed to occur through pumping. There has been some speculation on the Internet about the margins between stall speed and limiting Mach number at FL350 and weight of 205 tonnes. The margin is some 80-100 kts; this is large enough to allow the pilots considerable leeway in dealing with any in-flight abnormalities, such as having to fly the airplane on “pitch and power” when airspeed indications are unreliable. However, severe or extreme turbulence could make dealing with abnormalities such as unreliable airspeed a very tricky control situation indeed, at any moderately high flight level. It is plausible that an upset could thereby have occurred. The BEA report is factual and does not speculate on this.

There was considerable convective activity in the ITCZ at the time of AF447s passage. The weather, though, was pretty typical for the time of year, and had no unusual features from the point of view of meteorologists. There was a convective mass extending about 400km E-W, which the route of flight of AF447 crossed. This convective mass had formed at about 0130Z by the fusion of four powerful storm masses, deriving from convective columns (“towers” in French) , which had reached their limit and spread out horizontally as their tops reached the tropopause. The strongest of these had attained its most powerful stage many hours before. At 0200Z, the cumulonimbus clouds forming the mass had for the most part attained their mature stage. Although there may have been new columns forming between the mature columns underneath the top of the spreaded mass, there is no evidence for that in the form of a later “overshoot” into the stratosphere, which happens in the case of the most powerful columns. The temperature at the tops of the mass was by and large similar to that of the tropopause, around -80°C, as recorded by satellite 7 minutes before and after the presumed passage of AF447. The tropopause was estimated by the climate model ARPEGE to be at around FL520 at the date and time of the disappearance of the aircraft. Another aircraft participating in real-time weather data collection via AMDAR passed along the route half an hour later at FL325, then climbing to FL350 and did not record anything unusual, confirming largely what one may infer from the satellite images.

The BEA says it is “very likely” that some of the cloud mass contained significant turbulence at FL 350. Electrical activity was also “possible” at this FL. But, crucially for those wondering whether the pitots iced up because the aircraft may have flown into heavy supercooled-rain clouds, the presence of supercooled water was said to be “not very likely” and would necessarily have been limited to very small quantities. I consider the developments with possible pitot icing in a separate article.

The last known position of AF447 was transmitted automatically over ACARS at 0210Z. This position was N2°58.800′W30°35.400′, or N2.98°W30.59° in decimal degrees. The position transmitted was that contained in the “Flight Management” data, which is partly based on the inertial reference system. It could be, said the BEA, that the GPS position differed slightly from this.

This position puts the flight in or close to the column of what had been the most powerful of the fused storms, whose column had attained its most powerful stage some many hours before and was at the time in its mature stage. The position is between ORARO and TASIL waypoints and looks to be slightly off the airway.

The last verbal contact with AF447 was by the controller of FIR ATLANTICO, in Brazil, at 0135:43Z. The controller then asked AF447 four times for his estimate at TASIL, without response. There were apparently three attempts at an ADS-C connection with DAKAR, at 0133Z, 0135Z and 0201Z. These failed with code FAK4, indicating either the absence of a flight plan, or a significant discrepancy between flight number, reported position, and planned position. Section 1.9.2 says that at 0146 the DAKAR controller asked for information about AF447 because there was no flight plan. ATLANTICO gave type (A332), airport of origin, destination airport, and SELCAL sign. DAKAR created and activated a flight plan, but there was no connection with the aircraft either on voice or ADS-C. So the first two ADS-C attempts were rejected because of, we may presume, lack of a flight plan with DAKAR at those time. The report does not determine whether the flight plan at DAKAR was activated before or after the last ADS-C connection attempt at 0201Z. Although the transcript of the exchange between ATLANTICO and DAKAR at 0135Z is included in the appendices, the later exchange is not.

As I mentioned in my note of 11 June, the order of the ACARS messages received does not necessarily reflect their order of occurrence. The reasons why are largely the reasons I gave there, with one addition. Fault messages received by the CMC are cached but not sent for a minute, to accumulate and summarise in one ACARS transmission other messages associated with that fault from other avionics devices. These associated messages are indicated by including the reporting device in the fault message compiled by the CMC (using a * for associated messages of type 2, which are not reported to crew because they have no “operational consequences”). There is prioritisation within the CMC, as well as possible race conditions from various BITE devices to the CMC, as well as prioritisation of transmission: the report explains how ACARS messages are prioritised by class. And, of course, possible delays in the transmission and processing of messages through the ACARS transmission system itself.

The interpretation of the messages is, as the BEA says, “delicate”. This is not just because of the indeterminacy of order, but also because, while a fault may be recorded, a subsequent return to normal is not reported; certain alarms such as overspeed are not registered; and although all faults (type 1) are accompanied by a cockpit effect (type 2), not all faults have their cockpit effect registered, and not all cockpit effects have the associated fault registered.

Of the type 2 effects, the BEA says it has not succeeded in explaining the meaning of the cockpit effect NAV TCAS FAULT (cockpit effect is a flag on the PFD and ND) but has explained the significance of the others.
There are five type 1 fault messages, of which the significance of two are unexplained:

the ADIRU2 fault (IR2), associated with messages from EFCS1, IR1 and IR3. The involvement of EFCS1 is a type 2 message, and it is suggested that the correlation window may have been opened by this message;
The FMGEC1 message that was the last received before the cabin pressure warning.

The BEA concludes that the type 1 and 2 messages taken together show that there had been unreliable airspeed measurements and their consequences.

That is it. Not a whole lot more than we knew in mid-June, but some of it more firmly established, especially the interpretation of the weather and the integrity of the airframe.