Simulators and Veridicality in Airline Training and Pilot Currency Checks

9 09 2010

In his note in RISKS-26.15, Peter Wayner refers to the article Simulator training flaws tied to airline crashes in USA Today, 31 August 2010 (WWW version), which claims to have shown that «Flaws in flight simulator training helped trigger some of the worst airline accidents in the past decade» and that «More than half of the 522 fatalities in U.S. airline accidents since 2000 have been linked to problems with simulators».

I like to think I keep well up to date with commercial aircraft accidents, their analyses and causes, and am aware of simulator strengths and weaknesses. This suggestion struck me as somewhat thin. But if one reads the sentences literally, with their main verbs “helped trigger” and “have been linked to“, they do not speak of causes or causal factors. I can “help trigger” an accident if some USA Today journalist is so enraged by reading this note on hisher Blackberry that heshe runs a red light. And I can link USA Today with whom I wish simply by mentioning them in the same sentence in a Risks note. I am sure the newspaper intends stronger links than this, but it would be good to know what and how, and the article gives no clue. The NTSB uses the words “probable cause” and “contributing factors” in their conclusions and these terms have more precise meanings.

The article mentions three accidents: the November 12, 2001 American Airlines Airbus A300-600 loss of control on climb-out from New York; the December 20, 2008 Continental Airlines Boeing 737-500 takeoff loss of directional control at Denver; andthe February 12, 2009 Colgan Air Bombardier Q400 loss of control on approach to landing at Buffalo. The abstracts and links to the full reports are to be found on the NTSB WWW site as, respectively, DCA02MA001, NTSB Abstract AAR-10/04 and NTSB Abstract AAR-10/01. I invite readers to take a quick look at these very short synopses. These three accidents total 315 deaths and the USA Today article does not say which other accidents it counts.

Only the Denver accident causes and factors specifically mention simulators. The pilot flying lost directional control of the aircraft on the runway during takeoff, because of very high gusting crosswinds. The gust “exceeded the captain’s training and experience”, and according to the NTSB he failed effectively to use rudder to control the aircraft in the gust. The first contributing factor allows us to conclude that the crew did not receive timely and accurate information on the actual wind strength and direction. The second contributing factor is “inadequate crosswind training in the airline industry due to deficient simulator wind gust modeling“.

It is widely accepted in the industry that the most recurrent feature of most large-airplane commercial air accidents worldwide in the last few years has been loss of control. It used to be controlled flight into terrain, but it is now widely accepted that the Ground Proximity Warning System (GPWS) and its version Enhanced by terrain mapping using GPS and terrain maps (EGPWS) have reduced the incidence of such accidents considerably (although they still occur, as to an Airblue Airbus A321 on approach to Islamabad on 28 July, 2010 – see the Aviation Safety Net brief report).

The 2001 American Airlines accident was loss of control because of structural failure: the vertical fin separated from the aircraft. The NTSB found that the pilot flying had caused that separation by overstressing it through “rudder reversal” control inputs; contributing were the rudder control system design of Airbus, and American Airlines Advanced Aircraft Maneuvering [sic] Program AAMP. The NTSB heard both that AAMP discussed use of rudder to help recover from upsets, and that the FAA, Airbus and Boeing had expressed concern about this in a letter to American Airlines four years before. The pilot flying had been observed on a previous flight using rudder to control unwanted aircraft movement from environmental disturbance, and the captain on that flight, who gave evidence to the inquiry, had discussed it with him then. I refer Risks readers interested in more to the report, as well as to my paper The Crash of AA587: A Guide. The AAMP does involve simulator work, but a simulator cannot be known accurately to represent what would happen during unusual piloting rudder-reversal behavior because, well, until the accident nobody knew at what point airframe structure would fail (it turned out to be some one-third stronger than required by certification regulations)!

The pilot flying the Colgan Air accident aircraft reacted inappropriately to a stall warning, by pulling on the stick, and holding it back against the attempts of the automatic “stick pusher” system to push it forward. This resulted in the aircraft stalling at low altitude. Pushing the stick forward is the appropriate response. There was considerable discussion of the pilot’s aptitude, his level of awareness (relating to possible fatigue), and his overall Q400 training at Colgan Air. The NTSB remarked on features of that airline’s training program, which of course involves simulator work. But I don’t think it would be appropriate to conclude that there is anything much wrong with the simulators themselves.

Simulators do not necessarily accurately represent the behavior of aircraft close to the “edge” of their “flight envelope”, and they cannot be taken to do so for flight outside the envelope. Aerodynamicists study these “out of envelope” characteristics by use of wind tunnel models, but actual aircraft are not flown in flight test “out of envelope” except for certain restricted manoeuvres prescribed in the certification regulations (such as flying at “maximum operating airspeed” and initiating a 7.5° nose-down dive for 20 seconds, to mimic an overspeed excursion from cruise). For most “out of envelope” flight, aerodynamicists can make very well-educated guesses (from their wind-tunnel modelling) as to what might happen, but they are the first people to say that they are not at all certain. Nobody goes out to flight-test Boeing 747 aircraft in partially-inverted almost-vertical semi-spins, such as what happened to a China Air Lines Boeing 747 over the Pacific near San Francisco in 1985 (see the digitised version of the NTSB accident report in the entry in our Compendium. Incidentally, the human factors chair on this investigation tells me this was a watershed event for the investigation of human biorhythms and possible fatigue as potential contributors to accidents).

So there are limits to what simulators can achieve, and it is a matter for research how much “out of envelope” behavior can be usefully and veridically simulated. Since loss of control is now prominent amongst probable causal factors of accidents, it seems to me obviously worthwhile to perform this research. Where it will lead is anybody’s guess, as with most research. However, the NTSB’s concern in the Denver report is with situations that could be veridically modelled in flight simulators but currently are not. That could be, and probably should be, fixed.



Fully-Automatic Execution of Critical Manoeuvres in Airline Flying

3 09 2010

David Learmount’s semi-annual review of commercial air accidents has just appeared in Flight International (3-9 August, p34). There were three accidents to high-performance large commercial passenger jets: (1) a Ethiopian Airways Boeing 737-800 took off from Beirut over the sea at night and ended up in the ocean (25 January); (2) an Afriqiyah Airways Airbus A330-200 impacted the ground violently on approach to Tripoli’s RWY 9 (12 May); (3) an Air India Express Boeing 737-800 overran the runway at Mangalore (22 May). Recently, not included in David’s survey, (4) an Airblue Airbus A321 impacted high terrain while on approach to Islamabad (28 July); (5) an AIRES Colombia Boeing 737 landed and broke up on RWY 6 of San Andres Island (16 August); (6) an Embraer 190 of Henan Airlines impacted short of the runway and broke up on approach to Yichun (28 August).

Taking a random six months of accidents is not a sample conducive to pointing to trends using statistical methods; it is well-known amongst students of commercial air accidents that there are “fashions”, common features which cluster at a certain time, but which then reduce, without anybody necessarily doing anything much different. However, let us start here with the question that is the theme of this note:

Which of these accidents would likely have been avoided had the aircraft been fully automatically controlled?

Unmanned aircraft such as the military Global Hawk reconnaisance aircraft routinely fly complete missions under automatic control, from full stop to full stop. Other unmanned aircraft, such as the Predator «drones» used by the US Military in Afghanistan, and for US southern-border patrol, are remotely piloted, but have had control problems with the remote-piloting regime, as for example in this analysis of a US southern-border accident by Johnson and Shea. I want to emphasise here that we are indeed in the era in which fully automated long-distance flights are routinely flown (if only at present by the US Military, and, soon, other NATO allies with the Euro Hawk).

(1)Ethiopian had taken off into a «black hole» over the ocean at night, in other words into an environment in which there were no outside visual references whatsoever. The aircraft was performing a climbing turn, when it started to descend and disappeared from radar. There were electrical storms in the vicinity. The causes are not yet known, but certain factors have been proposed as hypotheses. The accident is almost certainly loss of control (LOC): no one presumes that the pilots committed suicide/murder. First, spatial disorientation of the pilots. This is a historical factor in the records of accidents in night takeoffs and landings in «black holes», such as over oceans. Second, a weather-related upset, say windshear of some kind causing loss of control (LOC). Such phenomena are also known historical factors. It is understood that no technical defects have been yet identified, but I also understand that the investigation is not yet complete.

If spatial disorientation of the pilots had been a causal factor, this would have been avoided by full automatic control of the takeoff and after-takeoff manoeuvring

(2)Afriqiyah was approaching RWY 9 at Tripoli, in clear weather but with reported «low, hazy visibility» (Learmount, op. cit.). «Information from the FDR and CVR indicates that there were no technical faults on the aircraft and fuel starvation was not an issue» (Learmount, op. cit.). Aviation Herald confirms this in its report, see in particular the update from the investigator’s information on 14 August. It impacted the ground heavily (even violently), some vertical distance below the approach path, indicating a high rate of descent. The impact was about 900m from the runway, according to Aviation Safety net’s report. The ground in the area of the airport is more or less flat. Although the VOR was NOTAMed unreliable, there is an NDB approach to RWY 9. The aircraft is capable through GPS equipment and NDB reference of constructing a «Continuous Descent Approach» (CDA) path, which gives a more-or-less constant rate or angle of descent to the point of touchdown, constructed by the Flight Management Systems using the exterior navigation aids, and it would have been able to do that at this airport at this time, as far as is now known. If the aircraft had been on a CDA, it would have been at about 200 feet altitude at this point (the arithmetic: assuming 3° approach path, about one-in-twenty, and a touchdown point 300m from the runway threshold, the aircraft impacted about 1200m from touchdown point, at which point it should have been at 60m above touchdown zone elevation (TDZE)).

Automatics are capable of controlling the airplane within a tens of feet of a given path, and routinely do so (indeed, they must do so in certain flight phases, such as cruise in european RVSM airspace). Given that there were no technical issues identified with the aircraft by the investigation, and violent weather was not a factor, a fully-automated CDA would have landed the aircraft on the runway; at least ensured it was not 300 ft below where it should have been assuming a normal 3° continuous-descent approach path.

(3)Apparently the Air India Express Boeing 737 «landed on RWY 24 just beyond the touchdown zone, in fair weather with no rain. It overran the runway end and plunged into a ravine (Learmount, op. cit.). According to the report by Aviation Herald, the runway has an ILS, required landing distance was 7500 ft and the runway length was 8100ft. There is no word yet, to my knowledge, on possible causal factors.

This seems to have been a routine landing, with no compromising weather. Such landings are routinely accomplished fully automatically, by the Hawk UAVs.

(4)The Airblue A321 had completed an ILS approach to RWY 30 at Islamabad, had turned right at low altitude and then left, to fly parallel to the runway. The crew is supposed at this point by many (with whom I currently concur, given the information available) to have been attempting a circle-to-land (CTL) manoeuvre, likely to land on RWY 12 (the reciprocal of the approach runway). CTL is a routine instrument flight rules manoeuvre, permitted from the ILS approach to RWY 30 as shown in this snippet from an approach plate, posted by «aterpster» in the PPRuNe discussion forum. In a CTL manoeuvre, the pilot, upon «obtaining a visual with» (i.e., seeing) essential parts of the runway or its environment, manoeuvres to land the airplane, provided the visual contact is continually maintained. If visual contact is lost, a routine «missed approach» manoeuvre must be immediately initiated. During the manoeuvre, the airplane must be flown within a given radius, just over 5 nautical miles, of a specified point on the airport. A diagram of this circling radius, overlaid on a plan of the airport and environment, appears in this post by «aterpster» in the PPRuNe discussion forum. A first approximation to the crash sight by, overlaid on a map with some of the navigation detail, including the CTL radius from a post by «aterpster» may be seen in this post by «PJ2», who updated his estimate of the approximate crash location some time later in this post. The crash site is reported by Aviation Herald to be about 10 nautical miles away, and in this early article in FlightGlobal, the WWW site of Flight International, to be 9.66 nm. The print version of the article (Flight International, 3-9 August 2010, p7) says 9.7 nm. There were reported to have been «no technical problems» in a later article in Flightglobal. So the impact site was at about twice the allowable CTL radius. The CTL radius encloses only flat land; the aircraft impacted «rising terrain», in other words a hill/mountain range nearby, but not so nearby as to constitute any danger to normal IFR operations.

There is a question, currently unanswered, as to why the EGWPS terrain-warning equipment did not enable the crew to manoeuvre to avoid the terrain.

Unlike a (presumed-)straightforward approach as at Tripoli, current commercial-aircraft automatics do not assist CTL manoeuvring in any reliable manner; the procedure should be hand-flown. However, it is a straightforward manoeuvre well within the capabilities of automatic control systems such as those on the Global Hawk to follow an ILS, and circle to land on the reciprocal runway, within the given limits. Automatics could have accomplished this manoeuvre within going outside the given CTL radius and therefore without a danger of impacting high terrain.

Furthermore, systems currently in test for the USAF, and shortly to become operational, perform automatic terrain-avoidance manoeuvres, even – expecially – during the kind of low-level manoeuvring performed by military pilots. The system is called Auto-GCAS and was extensively reported and flight-tested recently by Aviation Week (August 2, 2010, pp50-57). Here is a short blog on it by Stephen Trimble of FlightGlobal from last year.

Some proponents of EGWPS have suggested that avoidance manoeuvres in commercial air operations be automatically initiated and flown. This is well within current capability, as shown by Auto-GCAS.

(I have mentioned anonymous writers above. Here is what I know of them. “PJ2″ is someone I know, and with whom I have discussed accidents for a decade. He is a recently-retired captain for a major airline, where he was deeply involved in setting up the airline’s FOQA program. He is expert in aviation safety matters and I value his advice considerably. I do not know “aterpster”, but have read many public contributions by him. He self-indentifies as a former airline pilot who has been officially involved in accident investigations as a designated representative of pilots’ organisations.)

(5)Initial reports of the AIRES accident suggest that the aircraft landed short, for example this report in Aviation Herald. Weather is reported by FlightGlobal to have included thunderstorms in the vicinity. Some commentators on the PPRuNe thread have suggested that the main gear was torn off upon reaching the runway hard surface, which is elevated slightly above the surrounding terrain (one imagines the wheels sinking into software ground before the runway, and then impacting the hard runway construction).

It is not possible at this point to estimate the causal influence of the weather – one notes in the above references that the aircraft was reported to have sustained a lightning strike on final approach. But a landing of this sort to the TDZ is routine, even in stormy weather, for digital flight control systems. Providing, of course, they are sufficiently well insulated from the effects of a lightning strike.

(6)The Henan incident was also a landing-short, in reportedly benign weather – see for example the report in Aviation Herald – on a non-precision approach (NPA). The weather was reported as «foggy», but of course fog is incompatible with the kinds of atmospheric disturbances which might lead to control problems, and is not an issue for automatic control. A fully automatic landing was possible in these conditions, but not necessarily in the E190 accident airplane.

At this point, there is no public information about any technical problems with the flight. NPAs have been known for decades to be more accident-prone than precision approaches (ILS), but modern automation such as on the Embraer 190 can routinely perform CDAs, as discussed above with respect to the Afriqiyah accident.

None of the final reports are out, or expected yet, for any of these accidents. As things stand at present, the Ethiopian and ARIES accidents could have had the causal involvement of atmospheric disturbance, we don’t know. But other potential causal factors would have been mitigated if the manoeuvres had been performed fully automatically. In the case of the other four accidents, it seems quite reasonable to assert that, had the manoeuvres been performed fully automatically, outside the current capabilities of commercial-aircraft avionics but certainly within the routine capabilities demonstrated by Global Hawk UAVs, and the USAFs Auto-GCAS.

There are of course substantial safety issues with fully-automatic flight in civil airspace. It is correct to say that at this point it is not operationally feasible. For a recent review of some issues, see the forthcoming paper Computational Concerns About Integration….. by Johnson, to be read in two weeks at the SAFECOMP conference in Vienna.

So no one is yet suggesting, even for the medium term, pervasive fully-automatic commercial air transportation. But in light of the observations above concerning the six 2010 fatal accidents to large commercial jet aircraft, it does look as if it would be worthwhile to research whether standard approach and landing manoeuvres could be transitioned to routine fully-automatic execution.



The Internet as an Educational Tool

1 09 2010

Time was, we thought that people, students, who wanted answers to questions, could come to our office hours, ask, and be answered.

Then we thought that these people could pose these questions to bulletin boards and forums on the Internet, and get answers from all sorts of people, answers which were at least as good as, and maybe even better than, what they could get from us in our office hours.

How wrong we were! For an example of what happens when someone like me attempts to answer a question as if it were posed as a technical question to me during my office hours, see this thread on PPRuNe.

For background, “BOAC” is an experienced, wise, and mostly thoughtful pilot flew Lightnings for the RAF (a wonderful and singular machine, indeed the only aircraft which demonstrated it could outperform Concorde) and most recently Boeing 737 machines. “Pugilistic Animus” is someone who in my estimation has at least a graduate’s grasp of aerodynamics, and likely more – hard for me to tell (but he could, if he chose).

In traditional educational circles such as I have experienced since the 1970′s, the questioner would have posed the question, it would be answered as per my reply, and everyone would have gone back relatively satisfied to whatever they were doing. Handling the question via the forum, and parrying the denigrations so that the questioner, if heshe was still reading, could be more or less satisfied that the original answer was trustworthy, seems to have taken me at least four times as long, and who knows what the questioner makes of the interactions.

So, what conclusion do you, the reader, draw for the future of education via open Internet discussion forums? Please let me know, for I would dearly like it to work somehow, but this example does not give me hope.

PBL



Malware and the August 2008 Madrid Spanair Take-Off Accident

27 08 2010

On 20 August 2008, a MD-82 aircraft of the airline Spanair crashed on takeoff (TO) from Madrid-Barajas airport. The high-lift devices on the wing had not been properly configured to give the necessary lift on takeoff, and the aircraft was unable properly to lift off as planned. See Aviation Safety Net’s report of this accident for more details.

There had been a maintenance issue during a previous attempt at departure, and maintenance personnel had addressed this issue. In effecting the repair, however, the takeoff configuration warning horn, which aurally warns the crew that the high-lift devices are not appropriately configured for takeoff, had also been disabled. The crew is required, in the pre-take-off check list which they have to perform, to check that the aircraft is appropriately configured for takeoff, and it seems that they did not do so at the second departure: they performed some of the items, but not the full list.

Spanair uses a ground-based computer to process aircraft logs for maintenance issues. The fault which caused the accident aircraft to return to the gate had apparently occurred more than once the previous day, and been logged. But the press has recently reported that malware in this computer delayed the processing of reports, and so maintenance was not aware of the problem the previous day, when they would have been able to correct it, before the fated flight. The Press reports have thereby connected this malware with the accident. See, for example, a summary in english of the reports by Daniel Johnson on the University of York Safety Critical Systems Mailing List.

Brian Reynolds commented on these reports that “This is totally bogus” and clarified that he meant that it is “totally bogus” “[t]hat a virus or Trojan in a ground maintenance computer is casually related to this incident.

Reynolds seems to be denying the claim that malware in a ground-based maintenance computer is causally related to the accident. But he omitted to say what his criterion for causal-relatedness is.

I have one: the concept of necessary causal factor, proposed in 1973 by the philosophical logician David Lewis, who credits the concept to David Hume (his “second definition” of cause). I took over Lewis’s semantics 15 years ago for use in failure analysis.

According to this semi-formal, objective notion of causal factor, there is demonstrably a chain of causal factors leading from the presence of the malware to the accident. According to this concept, Reynolds is provably wrong.

So now let me show this.

Here is the Counterfactual Test:

Let A and B be events or states.

A is a necessary causal factor in the occurrence of B just in case:

If A had not occurred, then B would not have occurred.

This last sentence is called a counterfactual (or contrary-to-fact) conditional. “Conditional” comes from the “if…then…” form; “Counterfactual” from the fact that A and B did as a matter of fact happen, and one is supposing what the world would then have been like had A not occurred. In order to determine this, I adapt the Lewis semantics: suppose A had not occurred, but the world stayed otherwise as similar as possible to the actual state of affairs that pertained. Did B occur in this possible state of affairs? Most often, we cannot answer with absolute certitude “yes” or “no”, but it turns out that we can most often answer “most likely, yes”, or “most likely, no”. The Counterfactual Test is to ask this question I just posed. If the answer is “most likely, yes”, the Counterfactual Test is “passed” and A is a necessary causal factor of B. If the answer is “most likely, no”, then A is not a necessary causal factor of B. We have found the Counterfactual Test to be very useful in complex engineering failure analyses.

To show a causal connection between the presence of malware on the maintenance computer and the accident, here are five instances to check with the Counterfactual Test:

1. Had the malware not been present, the fault causing the phenomenon would have been noted by maintenance personnel in a timely manner (let us say: at latest, end of the previous day).
2. Had the fault causing the phenomenon been noted by maintenance personnel in a timely manner, it would have been appropriately repaired before the accident flight.
3. Had the fault been appropriately repaired before the accident flight, the TO-configuration warning would have sounded on the accident flight.
4. Had the TO-config warning sounded during TO on the accident flight, the TO would have been aborted when the warning sounded and the aircraft properly configured before subsequent TO.
5. Had the TO been aborted when the warning sounded, the aircraft would not have crashed as it did.

I consider all of these counterfactuals to be true according to the Lewis semantics. It follows:

1a. The presence of the malware was a necessary causal factor in the lack of timely awareness of the fault.
2a. The lack of timely awareness of the fault is a necessary causal factor in lack of timely repair.
3a. The lack of timely repair is a necessary causal factor in the TO-config warning inhibition.
4a. The TO-config warning inhibition is a necessary causal factor in continuing TO to loss-of-control.
5a. Continuing TO to loss-of-control is a necessary causal factor in the accident.

So, there is a chain of six causal factors, chain-length five, connecting the presence of malware to the accident. QED.

I emphasise, just to avoid misunderstanding, that these are by no means the only causal factors relevant to the accident: that the crew failed adequately to perform the pre-takeoff check list on the accident flight is most certainly a necessary causal factor in the loss of control. The reader is invited to try out the Counterfactual Test to assure himherself of this.

Applying the Counterfactual Test rigorously throughout the list of potentially-relevant factors, to see which ones are indeed causally relevant and which not, is the core of our analysis method Why-Because Analysis (WBA). For those interested in seeing relatively quickly how we perform WBAs nowadays, there is available a case study on how to perform a WBA using the SERAS Reporter and SERAS Analyst tools. Here is some general info concerning our experience with Why-Because Analyses. Typically, depending on the level of detail provided by the investigation, a detailed causal analysis (which we represent in graphical form as a Why-Because Graph) ends up showing a hundred to a couple of hundred individual factors, of which a quarter to a third are “root-causal factors”, that is, causal factors which are not regarded as themselves having pertinent causes. So WBA also includes a fair amount of bookkeeping, or “complexity control”, or whatever one wants to call it. For example, given a WBG with a couple hundred items, one would assemble these causal factors into a small number of subgroups, and give these subgroups appropriate titles, to provide an “executive summary” of the analysis. The SERAS Reporter and SERAS Analyst software is available as freeware from Causalis Limited .

We can well expect a full WBA of the Spanair accident to contain between a hundred and a couple of hundred factors.



Understanding Aerodynamics of Stalls

28 07 2010

Recently, most commercial transport airplane manufacturers have been revisiting their FCOM procedures for “stall recovery” (actually, procedures avoiding that an approach to stall turns into a stall). This may be related to the spate of recent accidents in which commercial airplanes have been stalled: Colgan Air in Buffalo, Turkish Airlines in Amsterdam, XL Airways in Perpignan. Such a spate of loss-of-control (LOC) accidents is a sudden new development in aviation accident statistics. People are concerned it might signal a trend and are looking for possible causes of this trend, if it is one.

A discussion on such matters started on the Professional Pilot’s Forum PPRuNe on a thread with the ungrammatical title of New (2010) Stall Recovery’s @ high altitudes. I agree with the moderator, who goes by the handle of John_Tullamarine, that the discussion has been stimulating, although I had my doubts as it started, which readers of the thread may observe.

The discussion has been enlightening in a number of respects. One aspect which startled me is the degree of understanding of stall – what it is, when it happens, and its functional relationship with other aerodynamic parameters. The stall is one of the most important, if not the most important, phenomena with which pilots must cope (preferably by avoidance). As is buffet. I conclude that such understanding amongst line pilots could be easily improved. There are graphs used by aerodynamicists; they are all more or less the same shape no matter what the airplane. You can find them in any intro-aero book, say John D. Anderson Jr.’s Introduction to Flight, or Richard Shevell’s Fundamentals of Flight, without any numbers on them, as well as in many FCOMs with numbers on them. Why are they not studied in type training and the knowledge tested?

It could be – has been in the thread – argued that “pilots don’t need to know” such things. As off and on a professional educator for the last few decades, I have participated in enough discussions about what technical practioners need, respectively don’t need, to know. I have seen what happens, at enough places. We as a profession are now – have been for a decade or two – giving computer science degrees to people who can’t really program very well, at least not according to the standards we used to have. Do I think this is – ever – a good idea? No, I don’t. Do I think people should be professionally flying complex airplanes without understanding the aerodynamics presented in the FCOM? No, I don’t. Although I imagine not everyone will agree.

My practical philosophy of education is as follows. My default answer to what people need to know is: everything. That said, there are practical limitations (of ability, of time) which entail prioritising knowledge and intellectual skills in using that knowledge (which, while related, are not the same thing).

Can we reduce such knowledge to algorithms, to operational instructions, as has been suggested in the thread: “if this happens, do Y”? I am sceptical. Choosing the correct action as a pilot requires appropriate situational awareness.

There was, for example, considerable debate about tail-plane stalls and training syllabuses following the Colgan Air upset in Buffalo. Colgan Air used a NASA video about stalls in icing as a training aid, and this video emphasised so-called tailplane stalls due to icing, for which the remedy is apparently inconsistent with the action required for a main-wing stall. The Q400 in the Buffalo crash is not at all susceptible to tailplane stalls, and is equipped with a stick pusher to prevent main-wing stalls. However, the pilot flying pulled the stick back, overpowering the stick pusher repeatedly, which is exactly the wrong thing to do if the main wing is on the point of stalling (it is, after all, the purpose of the stick pusher to do the right thing in this situation) but might be appropriate for tailplane stalls. It was therefore questioned whether the pilot had appropriate awareness of the aerodynamic situation the aircraft was in, and it has been concluded that he did not.

Having appropriate situational awareness requires understanding the phenomena, as well as understanding the limits of understanding. Needing to distinguish between what some people call “stalled” (namely, at or just beyond the maximum value C_L_Max of the coefficient of lift, C_L, when large parts of your wing may still be flying) and “fully stalled”, for example (when none of your wing is flying). The question arose in the thread whether one may use ailerons to lift a dropped wing at stall. The obvious answer is that you can if that part of the wing is still flying, but you most definitely should not if it is not. How do you tell which situation you are in? Trying and seeing is not a wise option.

Consider, for example, FCOM procedures concerning “stall warning” on a popular large airplane after lift-off (A330, 3.04.27 P 5a):

THRUST LEVERS ….. TOGA
At the same time:
PITCH ATTITUDE …..REDUCE
BANK ANGLE………..ROLL WINGS LEVEL
SPEED BRAKES……..CHECK RETRACTED

This assumes that you are at high angle of attack (AoA) but not yet stalled, and that the ailerons are still flying. The stall warning may also go off at high altitude, in which case: “relax the back pressure on the sidestick and reduce bank angle, if necessary”. In both cases, it is assumed that the wing is flying, but that bits of it are telling you they might not for much longer, and you need to back away from that point. These procedures obviously won’t help much at all if your nose gets to be way up in the air at 45°-60° of pitch, as happened at Perpignan with a related airplane.

The answer to telling which situation you are in is probably found in a good intuitive understanding of the aerodynamics in the FCOM, and for that one needs a good basic understanding of aerodynamics in general. One illustration of this is the suggestion that was made in the thread on a potential means of discriminating stall buffet from Mach buffet: the feeling of the frequency of the buffet.

This also illustrates the limitations of simulation, a topic on which it seems not all thread contributors are clear. It seems that many people still seem to think that flight simulators, including the expensive moving kind used for airline pilot training and recurrency, are veridical around upset scenarios. How on earth do these people suppose simulators can reproduce veridical buffet? That some aerodynamicist has sampled the frequencies of buffets in the wind tunnel, and given that to some simulator programmer to reproduce, as well as some engineer to make sure that none of it coincides with the resonant frequencies of the simulator? And most of those wind tunnel models that generate the data fly without horizontal tail pieces; what is the effect of the tail? Mostly, one doesn’t actually know, but extrapolates from one’s experience as an aerodynamicist. I feel that a basic understanding of aerodynamics would cure many illusions about the veridicality of flight-simulator behavior outside the normal flight envelope.

Whatever one thinks about what pilots should know or not know, it seems to me a good idea to clean up the vocabulary, suggested through the following examples.

“Stall” is a term of art: for example, sometimes it means the same as “at C_L_Max”, and sometimes it means “the point at which buffet is severe enough to discourage further increase in AoA” (cf. the definitions used in the airworthiness certification document, CS 25). Does being at or over the stall mean you have no lift? No, actually you may have more lift than in most other regimes of flight (just over C_L_Max) even though you might be shaking severely, or you may have much less (AoA way over that for C_L_Max).

Another terminological inexactitude resides in the terms “low-speed stall” and “high-speed stall”. The first refers to the situation in which the AoA is too high for the speed; the latter often refers to a transsonic overspeed situation, in which lift is reduced because of the formation of shock waves over certain parts of the wing, which waves, because they form at or near the leading edge, reduce lift forward and thereby move the center of aerodynamic lift rearwards, leading to a nose-down moment about the center of gravity of the aircraft, which gives nose-down pitch or “Mach tuck”. Use of this terminology leads one to the anomalous-sounding phenomenon of the “low-speed” stall at “high-speed”. Maybe the terminology “high-alpha stall” or “high-AoA stall” would be preferable to “low-speed stall”, and to use the word “transsonic” rather than “high-speed” to indicate effects of shock waves on lift?

Another vocabulary hang-up occurred in the discussion on the thread of V_s1g, or stall speed at 1g. Is it a constant speed or not? If not, with which aerodynamic parameters does it vary?

V_s1g would occur when lift at C_L_Max is equal to weight (W). Lift = q x S x C_L, where q is dynamic pressure and S is an area term usually taken to be the area of the wing planform. So at V_S1g, W = q x S x C_L_Max. Given that q = ½ x density x V^2, we can solve for V: V_s1g = Sqrt( (2 x W)/(density x S x C_L_Max)). S is obviously constant for a given airplane. What about C_L_Max? If you can ignore compressibility effects (i.e., below about 0.3 Mach for most wings) then C_L_Max is effectively constant, as is the AoA at which C_L_Max is achieved.

Now consider density. Air density obviously varies with altitude, indeed with the properties of the atmosphere on the day and at the place. So if one wants V_s1g to represent the true airspeed (TAS), then this obviously varies, but with a bunch of parameters not measurable with equipment on board most commercial aircraft. However, aerodynamicists like to talk Equivalent Air Speed (EAS), in which inter alia density is defined as sea-level standard-atmospheric density, 1.225 kg per m^3 (kg.m^(-3)).

So V_s1g, as EAS, varies only with (the square root of) weight. Weight obviously varies (with load, fuel burn and so on) but it is not an aerodynamic parameter, and is usually considered constant when talking aerodynamics. It follows that V_s1g, expressed as EAS, is constant.

However, V_s1g, as indicated, say, in the A330 FCOM (3.01.20 P7) is expressed in Calibrated airspeed (CAS), which is the pitot-static-measured airspeed corrected (usually digitally) for the effects of how the sensors are positioned in the air stream, and expressed in CAS there is a correction for pressure altitude, starting at about 20,000 ft for lighter weights, and going down to about 5,000 ft for heavier weights.

So, as a “practical” matter, is V_s1g (at fixed weight) constant or not? As an aerodynamicist, liking EAS, one would say yes, as a pilot, preferring CAS because that is what one sees on the airspeed indicator, one would say no. That could be a source of genuine confusion at times.

A more obvious but less insidious vocabulary hang-up is Mach number. Is it a speed? Strictly speaking, no. It has no units (it is a ratio of speeds: airspeed to the speed of sound, which varies with air temperature); whereas speeds have units of length per time unit (m. s^(-1) or ft . s^(-1) ). However, in response to a question “how fast were you going?” one might well respond “at 0.8 Mach”, and indeed Mach is used in preference to airspeed to adjust for many situations at high altitude. For example, limiting dive is expressed as both speed and Mach number, as is turbulent-air penetration, maximum operating (max. cruise), and so on.

Other vocabulary hang-ups occurred in the thread when talking about “approach to stall recovery” and “stall recovery”, and these I feel are insidious. Some correspondents (including the thread originator) insist they have been practicing “stall recovery” in an airplane with a stick pusher, despite the obvious point that if the airplane has a stick pusher and you respect the pusher, it is not stalled. Indeed, many “stall recovery procedures” are more accurately described as stall avoidance procedures, or approach-to-stall recovery. Surely such confusion would be resolved through a little aerodynamical knowledge and some common sense about safety-system design?

One correspondent, when asked repeatedly whether he thinks that test pilots have been going up and stalling Airbus airplanes, in order to rewrite the “stall recovery” FCOM procedures (actually “Stall Warning” in those for the A330 referred to above) and to calibrate simulators, wisely declined to answer. As a veteran pilot, with the handle 411A, said, Has anyone here actually stalled a large swept-wing airliner? I[f] so, what were the results?. Another, Airclues, replied In the early 80′s I was co-pilot on several C[ertificate] of A[irworthiness] air tests on the Boeing 747 when a full stall was completed (I believe that the UKCAA was the only authority that required this) and described his experiences. In other words, actually high-alpha-stalling large commercial aircraft, even for certification, is ancient history. I very much doubt it was done just to rewrite stall avoidance procedures and calibrate simulators.

A useful discussion indeed, but I suspect it will take more than a pilots’ forum thread to sort these issues out.



Risk Assessment of Volcanic Ash to Commercial Aviation

28 05 2010

Paul Marks of the New Scientist has a couple of good recent articles on the volcanic-ash problem for commercial aviation, one from today and one from last week.

I talked about a simple calculation of this risk in my Risk course this morning, since it is topical, it shows practical issues well, and it fits in about an hour’s lecturing (with anecdotes). It seems that few people want to or can perform an elementary risk calculation about flying in the volcanic ash from Eyjafjallajökull. Here goes. It’s very crude, but still leads to some insight.

Let us classify first the outcome categories per flight. I choose four:

1. No damage
2. Engine needs thorough inspection and cleaning
3. Engine needs major overhaul
4. Engines stop in flight.

All of these have happened. 1 to the majority of recent airline flights, 2 to a couple of Ryanair planes, and to the Finnish F-18s that had an encounter on April 15 , the day before the first ban, reported here previously, 3 to the (in)famous NASA DC-8
(at a cost of $3.2m, so one reads), 4 to Eric Moody on the famous BA 747 in 1982.

One can almost directly read off the severity from these. Let us consider units to be equivalently pounds or euros or dollars. The sign “^” means “to the”, the exponential. So, e.g., 10^4 = 10,000, 10^6 = 1,000,000.

Severity of events (event classes) 1-4
1. 0
2. 10^4 to 10^5
3. 10^6 to 10^7
4. If a catastrophe is caused (i.e. the airplane does not succeed in making a dead-stick landing on an airport) then 10^8-10^9

It is curious that these four categories fit so crudely but neatly into powers of 10, covering the range.

So the risk is (the old De Moivre definition from 1711):

probability(1).severity(1) + probability(2).severity(2) + probability(3).severity(3) + probability(4).severity(4)

In fact, this is only a crude estimate of severity, since if some engine is found to be damaged, then all engines on all airplanes flying into or from those airports that engine flew into and around those routes that engine took will have to be inspected as well, and that might run into the hundreds. This calculation does not take account of these knock-on effects.

Using severity(1) = 0, the risk per flight then lies between

10^4 x prob(2) + 10^6 x prob(3) + 10^8 x prob(4)

and
10^5 x prob(2) + 10^7x prob(3) + 10^9 x prob(4)

(using the factors of ten associated with the severity ranges)

Consider your average intraeuropean flight, say Air Berlin flying Paderborn-London Stansted. Boeing 737NG, let’s say 150 people on board (this is an overestimate), paying €100 per seat (actually, it’s lower, and much of that is airport tax). Your revenue for the flight is at most €15,000 (and a lot less if you take out airport tax). So your expected value of loss, the risk, above, must be less than this if you hope to do better than by not flying. So your decision criterion is

10^4 x prob(2) + 10^6 x prob(3) + 10^8 x prob(4) < 15,000

if you take the lower estimate of risk, and

10^5 x prob(2) + 10^7 x prob(3) + 10^9 x prob(4) < 15,000, that is

10^4 x prob(2) + 10^6 x prob(3) + 10^8 x prob(4) < 1,500

if you take the higher.

Let us take the lower estimate. You can handle a cleaning event without much trouble, but you had better be sure, to break even, that you have at most one chance in just over 60 flights of an overhaul event, and only one chance in just over 6,000 flights of an engine-out event.

Given what was known on April 16th about outcomes (for example, that the Finnish engines might be trashed), I wonder how much of what we heard from airline chiefs complaining about not being able to fly was political manoeuvring for government handouts to “compensate” them for being forced to do what a risk analysis would have told them to do anyway?

PBL



Oxford Up There Again

17 05 2010

The Times has written a blog-article on the proportion of the new UK government who went to Oxford (in fairness, I must point out that some proportion went to the Other Place, which is also rumored to be quite good). A perennial topic. I enjoyed reading the comments. But then I wondered whether the question could be seriously answered, and decided to have a go. (People may see the beginnings of an answer right there.)

My first degree: Oxford; my others: UC Berkeley (also a top tenner). I taught very briefly at Stanford, have worked at unis in Switzerland, France, Scotland and, for 15 years now, in Germany. I think I have a basis for comparison.

I felt like an outsider in Oxford, oppressed by the pressure of trying to achieve, and feeling that I wasn’t up to it, a feeling that it took me another decade to learn to ignore. But 40 years later, most of my matriculating class, including 5 of 7 maths people, turn up every few years for the reunions (one has died, and the Wykehamist disappeared during the course – is that what they learn there? Good prep for a career in offshore finance, I would think :-) ). Two of us turn up for the Maths Institute Garden Party every so often. Contact with pals at UC Berkeley lasted longer, for I was there more than twice as long; but I make only occasional email contact with one or the other.

Last year, there was a reunion to celebrate 40 years of my Oxford degree course, in Maths and Philosophy, and lots of people turned up, including all those still alive – two of them octogenarians – who were responsible for setting it up, as well as all the holders over the decades of the associated Chair. There were more intense discussions over those two days than I have experienced at most conferences. It was my most delightful intellectual experience of the last decade (here, my heartfelt thanks to Hilary Priestley, Dan Isaacson, and Jochen Königsmann for organising it!).

Compare. My Bielefeld colleague, Ipke Wachsmuth, a delightful man whom everybody likes, just celebrated his 60th with a symposium and fun party at which he played lots of blues harmonica (about which I learnt that the hard part is picking the kit that fits the tune). Lots of people there, but just four of us Informatics faculty, out of fourteen. And the oldest of us is 60 (him). It ain’t the same as in Oxford.

So what are the factors? First, some commentators said “contacts”. It is more than that. The college system somehow fosters bonds of shared experience, which may well directly benefit those who go into banking, law, or politics, which are all about trust. (Unlike Maths, which is about being faster than the next guy or gal, and Philosophy, which is about calling other people idiots as politely as possible, a skill only half of which I learnt.)

In the case of those in my small degree program, it is also a matter of shared intellectual value – value fostered by the founders, John Lucas, Sir Michael Dummett, the late Robin Gandy; the first holders of the chair, Dana Scott, Angus Macintyre; and Robin’s successor Alex Wilkie. People at the very top of their field, world-wide, with whom undergrads like me could sit down to tea and discussion a few times a week. That doesn’t happen elsewhere to anywhere near the same extent. But maybe it is invalid to generalise from my degree program to all those at Oxford.

Second, the tutorial system of teaching is unique and structurally supports “thinking outside the box”, if that’s what you can and want to do. The work is much less routine than, say, handing in the weekly homework exercises for a Stanford course, and it is always demanding because tutors tailor it to you – they have to do so, to keep their own interest up. Ah, yes, those tutors who spend more time teaching fewer students than any academics anywhere else in the world. Thank you, people, for your devotion! In my case, especially Ian Macdonald, who encouraged my interest in logic and encouraged me to switch to the “right” degree course, the eccentric Mark Broido, who set the hardest problems but pointed out that the only person who really cared if I solved them was me, an important lesson for a 19-year-old expecting to be told what to do, and Ralph C.S. Walker, who talked me down for two hours after I had royally blown the first paper in Finals, thereby enabling me to do passably on the rest.

Finally, a more diffuse factor. Do I care that the Nobel-Memorial Econ went to UCB last year? A little bit, yes, as with all those other Nobel Prize/Fields Medal/Turing Award winners there. Do I care that 4 out of 24 current UK cabinet members, plus the Attorney General, went to my Oxford college, Magdalen? Yes, most definitely. I am quite proud, even though I know none of them. So, third, the system seems to foster pride in one’s notionally shared common experience. That is a main bonding mechanism in successful governments, isn’t it? And how many experiences in life foster that? Like it or loathe it, it could be a factor. It seems to happen in France, too.

BTW, I am far more Whig than Tory and I guess the cabinet’s now both – does one say Whory, or is that too rude?.
BTW, II, the Other Place is organised more or less the same way, so similar observations hold, but of course just not as well…..
BTW, III, someone pointed out that “the Other Place” should be capitalised. Maybe; I’ve done so. Sorry.

PBL



The Political Economy of Volcanic Ash

28 04 2010

The Economist has of course a Briefing on the Effects of the Ash Cloud from Eyjafjallajökull on the political economy of flight, which informs its lead commentary in the April 24th 2010 edition, about this incident, entitled Earthly Powers.

Both articles recount that the “safe level” of ash was determined by the CAA (in Britain, but in fact the measure was coordinated across the continent) started out at zero, when the flight restrictions were first imposed on Thursday April 15th. And then it was changed on Wednesday April 21st to 2,000 micrograms per cubic meter. The Economist regards it as “suspicious” that the level was changed “in the face of an affluent cadre of displaced people, airlines feeling the pinch, a looming threat to some supply chains, and (in Britain) an election.” I don’t regard it as “suspicious” – I think, given the evolution of knowledge and experience, the sequence of administrative events was both coherent and justified, with the following caveat. The newspaper suggests, correctly, that how the new level was determined “is not clear”. The CAA apparently says it was set on the basis of data from equipment manufacturers, but no public data has been made available, and I agree with The Economist here that “Regulations without a clear and open argument behind them are worrisome”.

The state of knowledge about the safety of commercial airline operations as the situation evolved is well summarised by David Learmount in his blog entry of Monday, April 19th. I agree with much of what David says, and I think it serves to allay “suspicions” of administrative mismanagement of the event, such as hinted at by The Economist. The amount of uncertainty at that point on Monday of the risks involved, both likelihood and severity, was enormous. [Added 29.04.2010: I find David's article in Flight International, 27 April - 3 May 2010, pp8-9, largely identical with his 24 April article in Flightglobal on the subject, a careful recounting of the safety aspects of the event.]

By Tuesday, 20nd April, the ash had confined itself to lower flight levels; upper airspace was freed for flight, and by Wednesday 21st April new guidance had been issued and implemented. I still think that shows an exemplary reaction to the situation.

Now let’s look in a little more detail at the political economy involved. I had suggested in a note to the York Safety-Critical Mailing List, probably somewhat arrogantly, that people didn’t seem to be “conversant with probability or decision theory“. A respondent, Chris Hills, eminently confirmed my suggestion with his line of argument.

The Finnish Air Force went on a training sortie on Thursday 15 April and suffered apparent damage to some engines. FlightGlobal doesn’t say how long they were up for, but one might guess it was on the order of an hour. Recall from Learmount’s blog note that, on Monday 19th, it was not yet known what the severity of damage was to the Finnish engines – Learmount suggested they “may never power an aeroplane again“.

Suppose you are the CEO of an airline that wants to fly in closed airspace. Air Berlin, for example, takes in about €90 per passenger per flight from Paderborn to London Stansted if you book shortly before flying, a flight time of about an hour, and they use standard workhorses, which for trips inside Europe are the twin-engine Airbus A320 series and Boeing 737 series, with seats for between 150 and 200 passengers. The engines put out, I think, about three times as much thrust each as the military engines, but they are higher by-pass (meaning cold air which is propelled around and not through the core of the jet engine). Simple arithmetic shows us that the airline is taking in less than €20,000 for the Paderborn-Stansted flight. The cost of an engine rebuild or new engine (and, when one, then both!) lies well in the seven-figure range (I don’t know quite how much it might cost). That is, two orders of magnitude higher than the five-figure sum you are taking in. And until Monday 19th, after the research flights, no one really knew at what flight levels the ash was to be found. So, at a first guess, just to break even in monetary outlays only one flight in a hundred can have such problems. Or, to put it another way, if just one plane on that route has problems, then you have to have another 24 days of problem-free flying that route (two flights a day in each direction) to break even.

And, of course, this doesn’t take into account that, if one airplane has problems, you may well have to mandate the minute inspection of the engines of any other of your planes that flew part of that route around that time frame. And since airlines use a hub system, that means any planes which flew into or out of the hub into which the problem aircraft flew into or out of.

That doesn’t look hugely promising for deciding to fly, does it?

Here is a further way you might then think. Somebody else, associated with government, is telling you you can’t fly. So, whatever your actual evaluation of the risk, you can play grumpy, and argue that the decision-maker is proxy for the government, so the government should be sharing with you the enormous cost of your – forcibly, you say – not being able to do business. Even if you might not have wanted to have tried doing business in those conditions anyway.

So expect discussions about bail-outs.

And, if you are a CEO who read my last post on this topic, you will realise that the uncertainty inevitably led to even a good a priori decision about the risk being more cautious than it is likely that the actual situation warranted. So you could wait for the actual data to accumulate, knowing that you will, in all likelihood, be able to argue “see, it was less dangerous than you said; we told you so”. And you would be right, albeit disingenuously.

So expect to see that argument as the basis of discussions about bail-outs.

Now, about that 2,000 micrograms per cubic meter – we would really like to know where that came from, wouldn’t we?

BTW, it turns out the Finns’ engine problems were not terminal. Flight Global reports that the Finnish engines were healthier than they looked at first – on Friday 23rd April, a week after the ash encounter occurred and after Europe had returned to commercial flying.



Flying in Volcanic Ash, Part 2

22 04 2010

The ash cloud over Europe seems to have abated somewhat, and commercial air traffic is returning to the air. The German DLR organisation (equivalent to the US NASA) sent up test flights of a Falcon 20E on Monday and Tuesday 19-20 April, to measure what was up there. The report, in English, makes interesting reading (Here is a local copy, for those having trouble accessing the original URL). There are pictures in which you can see the ash layers below the aircraft.

It has rained, very briefly, say spottily for 5 minutes, on Tuesday and Wednesday here. My windows are now covered with a fine yellowish film of what I take to be ash (I have some skylight-type windows as well as vertical ones). The temperatures in Bielefeld, Germany, where I am (about 100km west of Hannover) have also been unusually low for this time of year, say 10° during the day in the sunshine (though with significant wind chill) and getting near zero at night. Indeed, it even snowed briefly in some places near here yesterday (Wednesday). The light is unusually white in the sunshine, an effect particularly pronounced in the evening. People used to smoggy atmospheres (Los Angeles, San Francisco Bay Area) will be familiar with this phenomenon.

The debates now seem to be concentrating on whether governments (rather, their regulatory agencies) were too cautious, not cautious enough, or just right. The consensus appears to be that the reaction, essentially to close the airspace where the highest concentrations were known to be until Wednesday, may have been more cautious than the facts warranted, as the UK Minister for Transport, Andrew Adonis, said in this report on Wednesday. The political fallout has started, as in this report from The Times.

For the record, I think the reaction to this environmental phenomenon has been exemplary. First, the dangers of flying gas turbines through volcanic ash can be catastrophic, as I noted (with reference) in my first post on this topic. (David Crocker pointed out to me an article in Boeing Aero magazine from before the current phenomenon, which gives the necessary background information for those still searching for it.) Second, this phenomenon, that a major part of the world for commercial air traffic at all altitudes was affected, was unprecedented. Third, over the course of a few days, test flights taking measurements were organised and flown by the only organisations capable of producing believable results. Fourth, everyone was involved: manufacturers, regulators, and government. Fifth, the outcome so far has been as good as it could be for safety: no commercial air passengers have been killed or severely injured; there have been no train accidents injuring people who would have flown but were forced to take the train; ditto for ships.

And, sixth, the main point of this note: if everything is done “right” (whatever “right” may mean), and safety is prioritised, it follows with high likelihood that, in hindsight, when more is known, it will be seen that we have erred noticeably on the side of caution. This note is a qualitative argument using probability theory (but no math!) that this is so.

When the facts come in, hindsight is a wonderful thing. Safety is paramount to the regulators, by their charter, and also to the manufacturers of the equipment because of liability. The national governments chose to prioritise safety. The result could not have been better for safety. There was, last week, virtually perfect uncertainty as to the potential effects of this particular cloud. Standard industry practice, for many years if not decades, is to avoid all volcanic ash. So, at the beginning, this practice, evolved over decades of experience, was followed, in the face of considerable uncertainty. Within a very few days, various organisations had determined that it was likely safe to fly, say, research aircraft. Data were gathered, uncertainty was reduced, we are back to flying.

What could have been done differently? Safety was prioritised in the face of uncertainty. Should we not have prioritised safety? My answer is that prioritising safety was exactly the right move.

So what does prioritising safety involve? Risk is generally construed as a combination of likelihood and severity of untoward events. What was the risk involved in flying? Likelihood of a volcanic ash encounter over most airspace in Western Europe was certain (the various meteorological offices knew it was there), so there is no uncertainty there. The uncertainty with this risk resides, then, exclusively with the severity of the phenomenon (the effects of the ash cloud). Previous experience shows that the “worst case” is catastrophic, both for the people involved and (as it would be) for the government and agencies that would be said to have “allowed” an accident to happen. (Although severe accidents have not happened directly, losing all of one’s engines is defined to be a “catastrophic” in aircraft-certification terms, because after a loss of all engines only environmental circumstances can affect whether one lands on-airport or off-airport, and the least favorable plausible environmental circumstances, here an off-airport forced landing and its likely deadly consequences, are taken to define the severity.) Since experience had shown that severity (defined as worst-case) over the sample (all volcanic-ash-encounter incidents) is catastrophic, one can attempt to define the sample more narrowly, to reduce the uncertainty if you like. What is the range of possible effects? Let us say, from mildy increased maintenance costs on gas turbine engines, to heavily increased maintenance costs, to flame-outs and the ensuing necessary tear-down of all engines of that type on all aircraft, up to the consequences of any accident resulting from near-simultaneous flame-outs of all engines on an airframe. We could presume on general physical principles that these effects are some function of the type of ash (known, and variable, in the current eruption), its density, and the length of exposure. But we don’t know what function. Furthermore, for all flights, there is going to be a range of densities encountered as well as a variety of lengths of exposure.

Now comes a little qualitative reasoning about likelihoods. This is the bit that people who haven’t studied the basics of probability and statistics don’t necessarily grasp, despite the best efforts of us professional educators over the decades. I am going to talk about a “bell curve”, and having just searched the WWW for “bell curve” it seems to me that we professional educators are somewhat to blame for this state of affairs, because the typical WWW explanations are technical enough to alienate anyone who doesn’t have a degree in higher mathematics, as we shall see in the reference immediately below! I will be avoiding any math here, but I do want to talk about “bump curves”.

A “bell curve” associates a range of possible values for a parameter (along the horizontal axis) with the frequency with which those values occur (on the vertical axis). The term itself is taken by technical people to refer specifically to the so-called Gaussian or Normal Distribution, in tech-speak. But actually I want to be more general than this. Take a look at the first graphic in that Wiki article, of “probability density function”, and you see four examples, in green, blue, red and yellow, of graphs I want to talk about. They are small at the ends and have a bump somewhere in the middle. Most uncertain phenomena look like this when you show values (horizontal) against frequency (vertical). When I say “like this”, I want now to allow that the “bump” can be pushed to one side, kinked, in all sorts of ways. Imagine that you had a Plasticine “bump” sitting on the floor, and you let your one-year-old stick hisher thumbs into it, push it around and so on, then you cut it in the middle with a knife and trace the outline of the cut on a piece of paper. It is going to be thicker nearer the middle and thinner near the edges. Let me call all these things “bump curves” for the sake of this note.

The particular “bump curve” I want to talk about is the “distribution” of severities of ash-cloud encounters. So on the “right hand side” we have all-engine flameouts (“catastrophic”); going to the left of that we have one-engine flameouts and consequent flight bans and tear-downs of all engines of that type; going further to the left we have highly increased maintenance (involving large costs and effort); moving further left we have mildly-increased maintenance; moving further we have insignificantly increased maintenance. Remember, we don’t know quite what this “bump curve” looks like, even whether it has “one bump or two”, and where the “bump” or “bumps” are. But let me assume it has, for all intents, one “bump”, to make it easier to follow my reasoning.

First, I want to make the “bump curve” more like a bell curve. I can do this as follows. Imagine I have drawn the bump curve on a rubber sheet. I have a metal frame, consisting of a horizontal track into which are inserted a succession of vertical rods. I can’t bend the rods or take them out of the track, but I can fix them anywhere I want on the track, as well as slide them left and right and then fix them in their new position. I glue my rubber sheet with the bump curve onto this frame of rods. Now, I slide the rods left and right, to stretch the sheet sideways more or less, to make it look more like the bell curve. So, for example, if the “bump” is to the right of center, then I stretch the sheet on the right of center until the curve on the right looks more like the curve on the left of the bump.

Now I have something that looks like the bell curve, but the scale on the horizontal is all distorted, because I have moved the rods around.

And now I draw a vertical line on the rubber sheet, at the point which divides the consequences which are not deleterious to safety (on the left) from those consequences which are deleterious (people killed or injured).

Suppose you are blindfolded, and some supernatural agent performs this manoeuvre I just described. You are blindfolded; you can’t see the curve, but you know it is more or less a “bell curve” because that is what the agent made it look like. You can feel the edges of the white board, so you know where the left side and right side of the curve lie (left edge: “insignificant”; right edge: “catastrophic”), and you can find the middle. But you don’t know how the rubber has been stretched, so you don’t actually know where the vertical “safety boundary” line is; whether it is to the left or to the right of middle.

Now you are given the following task. Put a mark on the board, as far to the right as possible, but to the left of the safety boundary line. Remember you don’t know where this line lies, because the agent has pulled the rubber in a way you didn’t and can’t observe. So you give it your best guess.

And behind you in line are another ninety-nine people who will try to perform the same task. All of you are perfect “rational agents”. In other words, you all think straight, think deep, are perfect at statistical and probabilistic reasoning, and do as well as you can at the given task. You are all trying to put your point as close to, but left of, the safety boundary line as you can guess. In other words, you are basically trying to guess where the line is.

I predict the outcome: almost all of you are going to place your point well left of center. If you don’t believe me, try it out with your “perfectly rational” group of friends!

Let us see what this means. Remember, we don’t actually know how the agent has stretched the curve, because we don’t know how the curve looked to start with. Suppose we now ask for the likelihood distribution of the position of the vertical “safety boundary” line. What is it going to look like? On general principles, it is going to look like some sort of “bell curve”. The bell curve is symmetric about its middle. But you and all your pals put your best guess as to where this line is on the left. That means that most of the area under the curve (which represents likelihood) is going to lie to the right of where you all put your points. That means that, when you don’t know where it is, it is most likely that the safety boundary line lies to the right of where you all put your points.

That means that your conjoint best guess as to where the safety boundary lies most likely errs noticeably towards the cautious (left) side. When somebody removes your blindfolds and you can see the curve (translated into our problem terms: somebody does the research so we know more about concentrations of ash in the atmosphere as well as what such concentrations might do to engines) you would expect to see that your choices are well to the left of the safety boundary line.

The moral of this story: if everybody were perfectly rational and used an appropriate risk-based approach with safety paramount, Lord Adonis’s statement is to be expected: the authorities should expect that they have guessed well left of the safety boundary line.

I hope to have shown you the following. Erring definitively on the side of caution is an expected outcome of a rational approach, in a situation of great uncertainty, to a risk of which the value ranges from insignificant to catastrophic.



Flying in Volcanic Ash

20 04 2010

The biggest political problem of the week seems to be that airlines have stopped flying in Europe, because of the ash cloud from the volcano Eyjafjallajökull. I must say that in Bielefeld it is wonderful to see the sky without the usual 15 or so condensation trails and the ensuing cirrus, but my wine/tea/coffee merchant and his son are stuck in Namibia at the end of a hunting holiday and desperately need to get back to work, so I understand well the economic side of this also.

Those who don’t understand what volcanic ash can do to gas turbine engines might want to check out this 2003 NASA report concerning damage to the engines of an aircraft which flew through an ash cloud on its way to Europe some years ago. The cloud was not visible to the pilots, and visual inspection of the engines on landing revealed no damage. But the engines were severely damaged. Many thanks to Robert Dorsett for finding this reference.

I have been reading a lot of half-thought-out commentary, but little that enumerates the issues. So here goes.

1. Volcanic ash contains a high proportion of silica. This particular eruption sequence has shown concentrations from just under one-half to about two-thirds, depending on the type of eruption (an eruption sequence is not necessarily uniform in type or composition), if some unnamed geologist cited by an anonymous poster on a forum is to be believed. (For those who wish to troll through the 90 pages of chatter on this on PPRuNe, I recommend in particular the contributions of the gentleman or lady name of “Sunfish”, who appears to be an Australian engineer, for example this one.)

2. The ash is very fine stuff.

3. The silica melts in some parts of the turbine, and gives other parts a nice glass coating as a consequence.

4. There are almost no data points for the behavior of engines under exposure to volcanic ash. There are just the occasional damage reports, as above. It is known that higher concentrations will cause flame out and seizing, but I doubt that the effect on engines of lower concentrations has been determined by anything much in the way of testing. For example, behavior on exposure to volcanic ash is not part of the certification requirements for engines. It looks like if you fly through it for a couple of hours then everything is OK on a visual inspection (thank you BA), but I doubt anyone knows what might happen if you fly through it for a week (an order-of-magnitude increase in exposure).

5. Suppose some engine, somewhere, has a problem. Then standard safety regulatory action would be to take the engine type out of service until it has been determined what the problem is. In this case, until one can rule out that flying numbers of hours through an ash cloud was not a causal factor. If it was a causal factor, then the fleet is grounded until all the engines can be rebuilt. That could take rather a long time – months, not weeks. And if the engine happens to be an intercontinental one, flying under ETOPS, then what do you do about ETOPS approval for that type, for those engines exposed to ash? ETOPS is predicated on independent failures, not on common-cause failures such as flying through ash.

6. Airlines dependent on transatlantic traffic to generate revenue, such as BA, are going to be hurting. But it would hurt a lot more to have ETOPS rescinded on the airline’s entire 777 fleet pending rebuild/overhaul of the engines.

7. The likelihood that one engine, somewhere on one wing, in Europe, will have a problem in the next couple of weeks, is, just on general experience, not small. For the consequences of that, see point 5 above.

It is a hard problem. The problem arises from (a) the environment – the fact that the ash cloud is there; (b) long established procedures for regulating aviation safety, which requires that a fleet be grounded upon evidence of a problem; (c) the unknown but tangible likelihood that some problem will occur; (d) the severe consequences of such a problem, given the established procedures for regulating aviation safety; (e) the severe economic consequences of closing down airline travel in such a busy part of the world.

I have no solutions. And I very much doubt that anyone else has any, either. As a safety person, I favor keeping aircraft out of this stuff until it goes away.

Postscript.

1. Thomas Netter pointed out to me a broadcast on France Culture today by Olivier Duhamel (available today, Tuesday 20 April, from the France Culture daily programming site, see time 07:55, and I take it later from the archives), who, Thomas said, pointed out that risks were evaluated with respect to aircraft, rather than taking a systems approach to aircraft travel and evaluating the general social cost of grounding. So let’s do it, superficially. Let the general cost of grounding for everyone be X per week. We have so far suffered X. If one engine shows up with ash damage, that will cost 2-4X, right there, since regs will require the fleets be town down and inspected, and I doubt that can be done in less than, say, a month. If we then ignore the regs, and have an aircraft lose both engines mid-Atlantic, that’s €300m – €1 billion out of insurers’ pockets (for which all air travellers have to pay, even though they might think it is only one airline). Not to speak of the political consequences for those who decide to let aircraft fly, when one is then lost. So those are the severities (some of them). Unless you can evaluate the likelihood of (a) discovering damage to one engine somewhere, and (b) having an ETOPS aircraft lose two, sometime in the future, due to ash damage, you cannot evaluate the social risk (usually taken as the multiplication of likelihood with severity for all hazards). I don’t hold much truck with saying that something isn’t being done, when no one can do it.

2. John Rushby just pointed out a thread in PPRuNe TechLog, which contains this interesting comment on what happens to gas turbines in ash clouds, by MFgeo.

3. The International Herald Tribune aka New York Times has this story today dealing inter alia with the politics. Apparently, [begin quote]The region is grappling with a new blow to its ability to act decisively during an emergency. ……… Most noisily, the head of the International Air Transport Association said before the announcement to partially lift the aviation ban that “the decision Europe has made is with no risk assessment, no consultation, no coordination, no leadership.” The industry group’s director general and chief executive, Giovanni Bisignani, went farther, saying that the crisis is a “European embarrassment” and “a European mess.”[end quote]

I think, in contrast to these suggestions, that the individual countries in the EU, which have legal responsibility for their airspace, have acted decisively, with “risk assessment” and “leadership” and what have you: the airspace is more or less closed; some flights with minimal possible exposure are taking place. You can’t get much more decisive than that. People who disagree with these measures could make their divergent risk assessments public. How about it, IATA?