The Political Economy of Volcanic Ash

28 04 2010

The Economist has of course a Briefing on the Effects of the Ash Cloud from Eyjafjallajökull on the political economy of flight, which informs its lead commentary in the April 24th 2010 edition, about this incident, entitled Earthly Powers.

Both articles recount that the “safe level” of ash was determined by the CAA (in Britain, but in fact the measure was coordinated across the continent) started out at zero, when the flight restrictions were first imposed on Thursday April 15th. And then it was changed on Wednesday April 21st to 2,000 micrograms per cubic meter. The Economist regards it as “suspicious” that the level was changed “in the face of an affluent cadre of displaced people, airlines feeling the pinch, a looming threat to some supply chains, and (in Britain) an election.” I don’t regard it as “suspicious” – I think, given the evolution of knowledge and experience, the sequence of administrative events was both coherent and justified, with the following caveat. The newspaper suggests, correctly, that how the new level was determined “is not clear”. The CAA apparently says it was set on the basis of data from equipment manufacturers, but no public data has been made available, and I agree with The Economist here that “Regulations without a clear and open argument behind them are worrisome”.

The state of knowledge about the safety of commercial airline operations as the situation evolved is well summarised by David Learmount in his blog entry of Monday, April 19th. I agree with much of what David says, and I think it serves to allay “suspicions” of administrative mismanagement of the event, such as hinted at by The Economist. The amount of uncertainty at that point on Monday of the risks involved, both likelihood and severity, was enormous. [Added 29.04.2010: I find David's article in Flight International, 27 April - 3 May 2010, pp8-9, largely identical with his 24 April article in Flightglobal on the subject, a careful recounting of the safety aspects of the event.]

By Tuesday, 20nd April, the ash had confined itself to lower flight levels; upper airspace was freed for flight, and by Wednesday 21st April new guidance had been issued and implemented. I still think that shows an exemplary reaction to the situation.

Now let’s look in a little more detail at the political economy involved. I had suggested in a note to the York Safety-Critical Mailing List, probably somewhat arrogantly, that people didn’t seem to be “conversant with probability or decision theory“. A respondent, Chris Hills, eminently confirmed my suggestion with his line of argument.

The Finnish Air Force went on a training sortie on Thursday 15 April and suffered apparent damage to some engines. FlightGlobal doesn’t say how long they were up for, but one might guess it was on the order of an hour. Recall from Learmount’s blog note that, on Monday 19th, it was not yet known what the severity of damage was to the Finnish engines – Learmount suggested they “may never power an aeroplane again“.

Suppose you are the CEO of an airline that wants to fly in closed airspace. Air Berlin, for example, takes in about €90 per passenger per flight from Paderborn to London Stansted if you book shortly before flying, a flight time of about an hour, and they use standard workhorses, which for trips inside Europe are the twin-engine Airbus A320 series and Boeing 737 series, with seats for between 150 and 200 passengers. The engines put out, I think, about three times as much thrust each as the military engines, but they are higher by-pass (meaning cold air which is propelled around and not through the core of the jet engine). Simple arithmetic shows us that the airline is taking in less than €20,000 for the Paderborn-Stansted flight. The cost of an engine rebuild or new engine (and, when one, then both!) lies well in the seven-figure range (I don’t know quite how much it might cost). That is, two orders of magnitude higher than the five-figure sum you are taking in. And until Monday 19th, after the research flights, no one really knew at what flight levels the ash was to be found. So, at a first guess, just to break even in monetary outlays only one flight in a hundred can have such problems. Or, to put it another way, if just one plane on that route has problems, then you have to have another 24 days of problem-free flying that route (two flights a day in each direction) to break even.

And, of course, this doesn’t take into account that, if one airplane has problems, you may well have to mandate the minute inspection of the engines of any other of your planes that flew part of that route around that time frame. And since airlines use a hub system, that means any planes which flew into or out of the hub into which the problem aircraft flew into or out of.

That doesn’t look hugely promising for deciding to fly, does it?

Here is a further way you might then think. Somebody else, associated with government, is telling you you can’t fly. So, whatever your actual evaluation of the risk, you can play grumpy, and argue that the decision-maker is proxy for the government, so the government should be sharing with you the enormous cost of your – forcibly, you say – not being able to do business. Even if you might not have wanted to have tried doing business in those conditions anyway.

So expect discussions about bail-outs.

And, if you are a CEO who read my last post on this topic, you will realise that the uncertainty inevitably led to even a good a priori decision about the risk being more cautious than it is likely that the actual situation warranted. So you could wait for the actual data to accumulate, knowing that you will, in all likelihood, be able to argue “see, it was less dangerous than you said; we told you so”. And you would be right, albeit disingenuously.

So expect to see that argument as the basis of discussions about bail-outs.

Now, about that 2,000 micrograms per cubic meter – we would really like to know where that came from, wouldn’t we?

BTW, it turns out the Finns’ engine problems were not terminal. Flight Global reports that the Finnish engines were healthier than they looked at first – on Friday 23rd April, a week after the ash encounter occurred and after Europe had returned to commercial flying.



Flying in Volcanic Ash, Part 2

22 04 2010

The ash cloud over Europe seems to have abated somewhat, and commercial air traffic is returning to the air. The German DLR organisation (equivalent to the US NASA) sent up test flights of a Falcon 20E on Monday and Tuesday 19-20 April, to measure what was up there. The report, in English, makes interesting reading (Here is a local copy, for those having trouble accessing the original URL). There are pictures in which you can see the ash layers below the aircraft.

It has rained, very briefly, say spottily for 5 minutes, on Tuesday and Wednesday here. My windows are now covered with a fine yellowish film of what I take to be ash (I have some skylight-type windows as well as vertical ones). The temperatures in Bielefeld, Germany, where I am (about 100km west of Hannover) have also been unusually low for this time of year, say 10° during the day in the sunshine (though with significant wind chill) and getting near zero at night. Indeed, it even snowed briefly in some places near here yesterday (Wednesday). The light is unusually white in the sunshine, an effect particularly pronounced in the evening. People used to smoggy atmospheres (Los Angeles, San Francisco Bay Area) will be familiar with this phenomenon.

The debates now seem to be concentrating on whether governments (rather, their regulatory agencies) were too cautious, not cautious enough, or just right. The consensus appears to be that the reaction, essentially to close the airspace where the highest concentrations were known to be until Wednesday, may have been more cautious than the facts warranted, as the UK Minister for Transport, Andrew Adonis, said in this report on Wednesday. The political fallout has started, as in this report from The Times.

For the record, I think the reaction to this environmental phenomenon has been exemplary. First, the dangers of flying gas turbines through volcanic ash can be catastrophic, as I noted (with reference) in my first post on this topic. (David Crocker pointed out to me an article in Boeing Aero magazine from before the current phenomenon, which gives the necessary background information for those still searching for it.) Second, this phenomenon, that a major part of the world for commercial air traffic at all altitudes was affected, was unprecedented. Third, over the course of a few days, test flights taking measurements were organised and flown by the only organisations capable of producing believable results. Fourth, everyone was involved: manufacturers, regulators, and government. Fifth, the outcome so far has been as good as it could be for safety: no commercial air passengers have been killed or severely injured; there have been no train accidents injuring people who would have flown but were forced to take the train; ditto for ships.

And, sixth, the main point of this note: if everything is done “right” (whatever “right” may mean), and safety is prioritised, it follows with high likelihood that, in hindsight, when more is known, it will be seen that we have erred noticeably on the side of caution. This note is a qualitative argument using probability theory (but no math!) that this is so.

When the facts come in, hindsight is a wonderful thing. Safety is paramount to the regulators, by their charter, and also to the manufacturers of the equipment because of liability. The national governments chose to prioritise safety. The result could not have been better for safety. There was, last week, virtually perfect uncertainty as to the potential effects of this particular cloud. Standard industry practice, for many years if not decades, is to avoid all volcanic ash. So, at the beginning, this practice, evolved over decades of experience, was followed, in the face of considerable uncertainty. Within a very few days, various organisations had determined that it was likely safe to fly, say, research aircraft. Data were gathered, uncertainty was reduced, we are back to flying.

What could have been done differently? Safety was prioritised in the face of uncertainty. Should we not have prioritised safety? My answer is that prioritising safety was exactly the right move.

So what does prioritising safety involve? Risk is generally construed as a combination of likelihood and severity of untoward events. What was the risk involved in flying? Likelihood of a volcanic ash encounter over most airspace in Western Europe was certain (the various meteorological offices knew it was there), so there is no uncertainty there. The uncertainty with this risk resides, then, exclusively with the severity of the phenomenon (the effects of the ash cloud). Previous experience shows that the “worst case” is catastrophic, both for the people involved and (as it would be) for the government and agencies that would be said to have “allowed” an accident to happen. (Although severe accidents have not happened directly, losing all of one’s engines is defined to be a “catastrophic” in aircraft-certification terms, because after a loss of all engines only environmental circumstances can affect whether one lands on-airport or off-airport, and the least favorable plausible environmental circumstances, here an off-airport forced landing and its likely deadly consequences, are taken to define the severity.) Since experience had shown that severity (defined as worst-case) over the sample (all volcanic-ash-encounter incidents) is catastrophic, one can attempt to define the sample more narrowly, to reduce the uncertainty if you like. What is the range of possible effects? Let us say, from mildy increased maintenance costs on gas turbine engines, to heavily increased maintenance costs, to flame-outs and the ensuing necessary tear-down of all engines of that type on all aircraft, up to the consequences of any accident resulting from near-simultaneous flame-outs of all engines on an airframe. We could presume on general physical principles that these effects are some function of the type of ash (known, and variable, in the current eruption), its density, and the length of exposure. But we don’t know what function. Furthermore, for all flights, there is going to be a range of densities encountered as well as a variety of lengths of exposure.

Now comes a little qualitative reasoning about likelihoods. This is the bit that people who haven’t studied the basics of probability and statistics don’t necessarily grasp, despite the best efforts of us professional educators over the decades. I am going to talk about a “bell curve”, and having just searched the WWW for “bell curve” it seems to me that we professional educators are somewhat to blame for this state of affairs, because the typical WWW explanations are technical enough to alienate anyone who doesn’t have a degree in higher mathematics, as we shall see in the reference immediately below! I will be avoiding any math here, but I do want to talk about “bump curves”.

A “bell curve” associates a range of possible values for a parameter (along the horizontal axis) with the frequency with which those values occur (on the vertical axis). The term itself is taken by technical people to refer specifically to the so-called Gaussian or Normal Distribution, in tech-speak. But actually I want to be more general than this. Take a look at the first graphic in that Wiki article, of “probability density function”, and you see four examples, in green, blue, red and yellow, of graphs I want to talk about. They are small at the ends and have a bump somewhere in the middle. Most uncertain phenomena look like this when you show values (horizontal) against frequency (vertical). When I say “like this”, I want now to allow that the “bump” can be pushed to one side, kinked, in all sorts of ways. Imagine that you had a Plasticine “bump” sitting on the floor, and you let your one-year-old stick hisher thumbs into it, push it around and so on, then you cut it in the middle with a knife and trace the outline of the cut on a piece of paper. It is going to be thicker nearer the middle and thinner near the edges. Let me call all these things “bump curves” for the sake of this note.

The particular “bump curve” I want to talk about is the “distribution” of severities of ash-cloud encounters. So on the “right hand side” we have all-engine flameouts (“catastrophic”); going to the left of that we have one-engine flameouts and consequent flight bans and tear-downs of all engines of that type; going further to the left we have highly increased maintenance (involving large costs and effort); moving further left we have mildly-increased maintenance; moving further we have insignificantly increased maintenance. Remember, we don’t know quite what this “bump curve” looks like, even whether it has “one bump or two”, and where the “bump” or “bumps” are. But let me assume it has, for all intents, one “bump”, to make it easier to follow my reasoning.

First, I want to make the “bump curve” more like a bell curve. I can do this as follows. Imagine I have drawn the bump curve on a rubber sheet. I have a metal frame, consisting of a horizontal track into which are inserted a succession of vertical rods. I can’t bend the rods or take them out of the track, but I can fix them anywhere I want on the track, as well as slide them left and right and then fix them in their new position. I glue my rubber sheet with the bump curve onto this frame of rods. Now, I slide the rods left and right, to stretch the sheet sideways more or less, to make it look more like the bell curve. So, for example, if the “bump” is to the right of center, then I stretch the sheet on the right of center until the curve on the right looks more like the curve on the left of the bump.

Now I have something that looks like the bell curve, but the scale on the horizontal is all distorted, because I have moved the rods around.

And now I draw a vertical line on the rubber sheet, at the point which divides the consequences which are not deleterious to safety (on the left) from those consequences which are deleterious (people killed or injured).

Suppose you are blindfolded, and some supernatural agent performs this manoeuvre I just described. You are blindfolded; you can’t see the curve, but you know it is more or less a “bell curve” because that is what the agent made it look like. You can feel the edges of the white board, so you know where the left side and right side of the curve lie (left edge: “insignificant”; right edge: “catastrophic”), and you can find the middle. But you don’t know how the rubber has been stretched, so you don’t actually know where the vertical “safety boundary” line is; whether it is to the left or to the right of middle.

Now you are given the following task. Put a mark on the board, as far to the right as possible, but to the left of the safety boundary line. Remember you don’t know where this line lies, because the agent has pulled the rubber in a way you didn’t and can’t observe. So you give it your best guess.

And behind you in line are another ninety-nine people who will try to perform the same task. All of you are perfect “rational agents”. In other words, you all think straight, think deep, are perfect at statistical and probabilistic reasoning, and do as well as you can at the given task. You are all trying to put your point as close to, but left of, the safety boundary line as you can guess. In other words, you are basically trying to guess where the line is.

I predict the outcome: almost all of you are going to place your point well left of center. If you don’t believe me, try it out with your “perfectly rational” group of friends!

Let us see what this means. Remember, we don’t actually know how the agent has stretched the curve, because we don’t know how the curve looked to start with. Suppose we now ask for the likelihood distribution of the position of the vertical “safety boundary” line. What is it going to look like? On general principles, it is going to look like some sort of “bell curve”. The bell curve is symmetric about its middle. But you and all your pals put your best guess as to where this line is on the left. That means that most of the area under the curve (which represents likelihood) is going to lie to the right of where you all put your points. That means that, when you don’t know where it is, it is most likely that the safety boundary line lies to the right of where you all put your points.

That means that your conjoint best guess as to where the safety boundary lies most likely errs noticeably towards the cautious (left) side. When somebody removes your blindfolds and you can see the curve (translated into our problem terms: somebody does the research so we know more about concentrations of ash in the atmosphere as well as what such concentrations might do to engines) you would expect to see that your choices are well to the left of the safety boundary line.

The moral of this story: if everybody were perfectly rational and used an appropriate risk-based approach with safety paramount, Lord Adonis’s statement is to be expected: the authorities should expect that they have guessed well left of the safety boundary line.

I hope to have shown you the following. Erring definitively on the side of caution is an expected outcome of a rational approach, in a situation of great uncertainty, to a risk of which the value ranges from insignificant to catastrophic.



Flying in Volcanic Ash

20 04 2010

The biggest political problem of the week seems to be that airlines have stopped flying in Europe, because of the ash cloud from the volcano Eyjafjallajökull. I must say that in Bielefeld it is wonderful to see the sky without the usual 15 or so condensation trails and the ensuing cirrus, but my wine/tea/coffee merchant and his son are stuck in Namibia at the end of a hunting holiday and desperately need to get back to work, so I understand well the economic side of this also.

Those who don’t understand what volcanic ash can do to gas turbine engines might want to check out this 2003 NASA report concerning damage to the engines of an aircraft which flew through an ash cloud on its way to Europe some years ago. The cloud was not visible to the pilots, and visual inspection of the engines on landing revealed no damage. But the engines were severely damaged. Many thanks to Robert Dorsett for finding this reference.

I have been reading a lot of half-thought-out commentary, but little that enumerates the issues. So here goes.

1. Volcanic ash contains a high proportion of silica. This particular eruption sequence has shown concentrations from just under one-half to about two-thirds, depending on the type of eruption (an eruption sequence is not necessarily uniform in type or composition), if some unnamed geologist cited by an anonymous poster on a forum is to be believed. (For those who wish to troll through the 90 pages of chatter on this on PPRuNe, I recommend in particular the contributions of the gentleman or lady name of “Sunfish”, who appears to be an Australian engineer, for example this one.)

2. The ash is very fine stuff.

3. The silica melts in some parts of the turbine, and gives other parts a nice glass coating as a consequence.

4. There are almost no data points for the behavior of engines under exposure to volcanic ash. There are just the occasional damage reports, as above. It is known that higher concentrations will cause flame out and seizing, but I doubt that the effect on engines of lower concentrations has been determined by anything much in the way of testing. For example, behavior on exposure to volcanic ash is not part of the certification requirements for engines. It looks like if you fly through it for a couple of hours then everything is OK on a visual inspection (thank you BA), but I doubt anyone knows what might happen if you fly through it for a week (an order-of-magnitude increase in exposure).

5. Suppose some engine, somewhere, has a problem. Then standard safety regulatory action would be to take the engine type out of service until it has been determined what the problem is. In this case, until one can rule out that flying numbers of hours through an ash cloud was not a causal factor. If it was a causal factor, then the fleet is grounded until all the engines can be rebuilt. That could take rather a long time – months, not weeks. And if the engine happens to be an intercontinental one, flying under ETOPS, then what do you do about ETOPS approval for that type, for those engines exposed to ash? ETOPS is predicated on independent failures, not on common-cause failures such as flying through ash.

6. Airlines dependent on transatlantic traffic to generate revenue, such as BA, are going to be hurting. But it would hurt a lot more to have ETOPS rescinded on the airline’s entire 777 fleet pending rebuild/overhaul of the engines.

7. The likelihood that one engine, somewhere on one wing, in Europe, will have a problem in the next couple of weeks, is, just on general experience, not small. For the consequences of that, see point 5 above.

It is a hard problem. The problem arises from (a) the environment – the fact that the ash cloud is there; (b) long established procedures for regulating aviation safety, which requires that a fleet be grounded upon evidence of a problem; (c) the unknown but tangible likelihood that some problem will occur; (d) the severe consequences of such a problem, given the established procedures for regulating aviation safety; (e) the severe economic consequences of closing down airline travel in such a busy part of the world.

I have no solutions. And I very much doubt that anyone else has any, either. As a safety person, I favor keeping aircraft out of this stuff until it goes away.

Postscript.

1. Thomas Netter pointed out to me a broadcast on France Culture today by Olivier Duhamel (available today, Tuesday 20 April, from the France Culture daily programming site, see time 07:55, and I take it later from the archives), who, Thomas said, pointed out that risks were evaluated with respect to aircraft, rather than taking a systems approach to aircraft travel and evaluating the general social cost of grounding. So let’s do it, superficially. Let the general cost of grounding for everyone be X per week. We have so far suffered X. If one engine shows up with ash damage, that will cost 2-4X, right there, since regs will require the fleets be town down and inspected, and I doubt that can be done in less than, say, a month. If we then ignore the regs, and have an aircraft lose both engines mid-Atlantic, that’s €300m – €1 billion out of insurers’ pockets (for which all air travellers have to pay, even though they might think it is only one airline). Not to speak of the political consequences for those who decide to let aircraft fly, when one is then lost. So those are the severities (some of them). Unless you can evaluate the likelihood of (a) discovering damage to one engine somewhere, and (b) having an ETOPS aircraft lose two, sometime in the future, due to ash damage, you cannot evaluate the social risk (usually taken as the multiplication of likelihood with severity for all hazards). I don’t hold much truck with saying that something isn’t being done, when no one can do it.

2. John Rushby just pointed out a thread in PPRuNe TechLog, which contains this interesting comment on what happens to gas turbines in ash clouds, by MFgeo.

3. The International Herald Tribune aka New York Times has this story today dealing inter alia with the politics. Apparently, [begin quote]The region is grappling with a new blow to its ability to act decisively during an emergency. ……… Most noisily, the head of the International Air Transport Association said before the announcement to partially lift the aviation ban that “the decision Europe has made is with no risk assessment, no consultation, no coordination, no leadership.” The industry group’s director general and chief executive, Giovanni Bisignani, went farther, saying that the crisis is a “European embarrassment” and “a European mess.”[end quote]

I think, in contrast to these suggestions, that the individual countries in the EU, which have legal responsibility for their airspace, have acted decisively, with “risk assessment” and “leadership” and what have you: the airspace is more or less closed; some flights with minimal possible exposure are taking place. You can’t get much more decisive than that. People who disagree with these measures could make their divergent risk assessments public. How about it, IATA?



Thoughts on Engineering Communication (with a bit on Ice Particle Icing and AF447)

21 08 2009

I have been thinking recently about professional engineering communication.

I was reminded once again of the lack of consensus by Nancy Leveson’s comment that “[t]he type of limited interaction that is possible by email is just not conducive to communication” as well as her regret at being “… pulled into one of these web debates because it takes so much time and produces so little”, on the University of York safety-critical systems list http://www.cs.york.ac.uk/hise/safety-critical-archive/2009/0369.html. I don’t agree with this view on email. I am a heavy user of email, both for longer essay-style pieces (although I am now moving more towards blogging) and for short exchanges. I consider e-mail lists such as that run by York to be an appropriate and helpful form of professional communication. I might agree partially with her view on WWW forums, because I find some forms problematic for professional purposes, but then again I think some of them work well (for example, the York list archive is a WWW forum).

I think no one medium available to us satisfies all the communicative needs of engineers in a developing field. I propose that prowess in engineering communication, traditionally required for evaluation of academic personnel, be based on more than traditional journal- and conference-paper publishing.

Advance in engineering depends on communication somehow. If one person in the world finds out how to solve engineering problem X, then unless heshe spreads the word, or word gets around via hisher customers, that technique remains hidden and others will not use it to solve problem X.

For the solution of specific engineering problems, or for the communication of engineering problems themselves (such as the “hot topic” of ice particle icing), it seems to me that traditional journal and conference publication works quite well, even though there are all sorts of problems with peer review procedures.

However, for discussion of current practice, or historical practice, and for discussion in general, declamatory articles such as those which appear in journals or conference proceedings don’t work that well. Neither do the magazines (because articles are by their nature declamatory). Journal or magazine letters pages also don’t work that well – witness the recent interchange between Keith Miller and myself on the Gotterbarn/Miller paper in the June 2009 IEEE Computer, which proceeded much more rapidly and fruitfully, but also privately, by e-mail than it did through the letters/reply section in the magazine (IEEE Computer, August 2009). See previous blog posts here for the public exchange.

I hold discussion to be very important in the engineering profession. Witness, if you will, again, the Ladkin/Miller exchange. Had this not occurred, Messrs Gotterbarn and Miller would be on record as holding that the recent A330 incidents were an instance of SE ethical problems of a certain sort, whereas they now agree that the issues are more subtle, if not other, than they originally proposed. A change of view arrived at through discussion.

Consider another example: how does one best handle issues of best practice, such as formal-language specifications versus natural-language specifications? Such issues need discussion: some think “natural language specs are best”; others think “formal language specs are best”, and there are different communities of practice built around these views. If you work in safety-critical electronics in the European railway industry, you must use natural-language requirements specifications because the standard says so, even though you might think this is a load of junk. Whereas if you work in one of the more prestigious sectors of avionics, you would likely do formal-language specs, even if you were a nat-lang-spec enthusiast.

Some people think the standardisation processes suffice for communication of best practice. Others think, as I do, that the neither the standardisation process nor the emission of standards suffices to communicate best practice. Indeed, I would go further. I also do not think the emission of standards necessarily embodies best practice, as my contributions over the years on the functional-safety standard IEC 61508 on the York list may indicate.

So what does embody best practice and how does one tell? Well, one thing to observe about the engineering profession is that there is no one way to skin a cat. There are many, and the best engineers will be intimately familiar with all of them, or at least with as many as they can be. One engineer may prefer one way, another engineer another way. What could they suggest to a third engineer, also attempting to skin a cat?

Engineer A: “Do it my way.” Engineer B: “Don’t do it his way; do it my way”

or

Engineer A: “Do it my way.” Engineer B: “Yes, do it his way; don’t do it my way”

or

Engineer A: “I do it this way, but any other way will work. However, I can help you best with my way.”

All these answers are possible from responsible engineers, who would have taken into account their interrogator’s environment and that of hisher task.

Engineers must interact this way. It is an important part of what they do. It is communication, it is necessary, and the question I wish to address is how, using what form, it may best be accomplished.

Let’s make it more concrete, with a concocted example whose content appears regularly on the York list.

Question: “I am building such and such a safety-critical system and we have to use the programming language C because that is what we have a compiler for, for the chosen hardware. Is this OK or should I veto the project.”

Answer 1: “Your source code, if it is written in C, will have no well-defined unique meaning. C compilers have odd quirks such as producing different object-code behavior depending on which order one writes the arguments to a test, and ………. So you will not be able to tell exactly what your object code does and thereby not be able to assure the behavior of your system to the required degree. To get the highest degree of assurance attainable by any practice to date, use, say, SPARK and an Ada compiler to avoid the problems with C detailed above, and to take advantage of the documented quality of SPARK code development. This may necessitate changing the underlying hardware if there is no Ada compiler targeted to your hardware. If you can’t change the hardware then recommend SPARK for the above reasons and at the same time veto the project.”

Answer 2: “There exists enough experience with C and C subsets such as the MISRA subset and C static analysis tools that you can be fairly assured of a more-or-less unique meaning for what your object code does, providing you pay a lot of attention to the known weaknesses of C constructs as listed in [a ten-year-old book] and you are careful about your choice of compiler and carefully research the known problems with the compiler and avoid them. The available analysis tools aren’t perfect but they are pretty good for most purposes. And, besides, Engineer Y has shown one can [read: he can] do this in a significant project. And, besides, everybody does it. And, besides, if you are stuck with this hardware, as you say you are, you have no real choice.”

Ripost from Answerer 1: “Sure, Y is one of five people, or fifty people, or one hundred people in the world that have a track record of doing this. Hire him. Or one of the other 5/50/100. Then you might be OK. Else, do it the way I said.”

Now, imagine you are buying the car in which this equipment is installed, one of a few thousand, or a few hundred thousand, or a few million built, for your family. Wouldn’t you rather that such a discussion had taken place in a highly prestigious forum, which as many eminent engineers as possible read, and can contribute their views, as required? And that some sort of consensus had developed as to what the questioner should do, and that some sort of assurance was available that heshe had done it?

So what would that forum be? The York mailing list? Not really- not all professionals read that list, and some of them think about it that “[t]he type of limited interaction that is possible by email is just not conducive to communication.” Leading journals that everybody reads? Well, it doesn’t happen. Or, better said, in my experience the journals in which such things appear are not much worth reading. Why is that?

It is that way, I propose, because this kind of discussion is not accorded the prestige which, say, journal publication of research accrues. In my view, a way should be found to value participation in insightful and fruitful discussion as prestigiously as journal publication, because such discussion is equally vital to engineering, as I hope I have just shown.

Well, a gainsayer might say, Engineer A can publish hisher view in a journal. Then Engineer B can reply. And then Engineer C, and so on.

I don’t think that will work in general. Consider the following recent example.

On June 1, an Air France A330 crashed into the South Atlantic in an area of unstable weather, having sent a series of cryptic maintenance messages from the Central Maintenance Computer as its last communications. Bits of the aircraft have been found, but not the bits most important to knowing why it went down.

Somebody found and published a report from another airline of a flight which had suffered similar phenomena at a similar altitude. And then other reports surfaced. People who had access to these reports had their own professional interests which would induce them to certain behavior, such as keeping them quiet or broadcasting them. Broadcasting is the only stable state: you cannot keep something under wraps once it has been broadcast. One of the major players is an anonymous broadcaster, a WWW site, called eurocockpit.com. The advantage of broadcast in this instance is that all the various pieces of data, available only to some people and not to others, have been brought together into the public domain.

The result of this communication activity has been that, probably within a month and certainly within two months, almost all pilots are aware of and wary of a phenomenon which on May 31, 2009 was not known to exist: high-altitude ice-particle icing of air data sensors. There were individual incidents, indeed many, but nobody knew about them all, and if you just know about one or two perplexing incidents there are many possible causes of it or them. But when you have a dozen, or a couple of dozen, and another one occurs as you are wondering, then it concentrates the mind wonderfully. The result is that EASA has published a proposal for an Airworthiness Directive aimed at replacing all those sensors thought to be more susceptible to ice particle icing than others.

The odd thing about this example is that the airplane in question has been in service for well over a decade, indeed much nearer two, and these incidents have apparently only occurred since March 2008. Explain that one! (Anyone who says “global warming” must go stand in the corner for an hour :-) )

My view is that you cannot explain it at the moment, but that the communication behavior around whatever symptoms of whatever phenomenon we are talking about here (likely ice particle icing) could have been different from what it was up to the loss of Air France flight 447 on June 1, 2009, which apparently suffered these symptoms. And maybe it could have been different in such a way as to have led to measures which could have averted the loss of an airplane and its occupants? A fine article on this history, which raises this question, has recently been written by Jens Flottau and appeared in Aviation Week and Space Technology on August 10, 2009: Response to Airbus Pitot Tube Incidents Under Scrutiny.

To be clear: I am talking here about forms of communication which we use, and not at all about any specific individual or organisational behavior. I am not suggesting that any individual, group or organisation did less than the very best they could about the evolving issue. Indeed, this remark serves to strengthen my suggestion that the communication forms themselves can give us a level of control over engineering developments, such as experiencing, recognising and then handling ice particle icing of air data sensors, which we do not currently possess.

It is not just ice particle icing of air data sensors. Ice particle icing caused engine problems to one type of engine on the BA146 airplane. It was not known to occur to others, but some Boeing and Honeywell engineers looked at incidents of surge, flameout and other anomalous events at altitude on other airplanes and came to the conclusion that they were due to icing phenomena at high-altitude, sometimes in cloud which was so thin that it barely hindered visibility. This stuff has appeared in the journal literature: see The Ice Particle Threat to Engines in Flight by Mason, Strapp and Chow, 2006, which refers to Cloud Particle Measurements in Thunderstorm Anvils and Possible Threat to Aviation by Lawson, Angus and Heymsfield, 1998. And in 2006 there came NTSB Recommendations to the FAA. But there were still 20,000-hour long-haul pilots (for all I know, still are), a group of people to whom this phenomenon would surely be of great interest, who apparently do not know of this work. One said even as late as a month ago that he does not accept ice forms below -40°C: http://www.pprune.org/tech-log/381558-ice-crystals-2.html#post5070024, and http://www.pprune.org/tech-log/381558-ice-crystals-3.html#post5074951.

It is through the communication of incidents, each of which was previously known only to a few people, many of those people being different people, that a dangerous phenomenon, ice particle icing of air data sensors at high-altitude and cold temperatures, has been identified. This is a significant engineering achievement. How did it happen? WWW. E-Mail Lists. WWW Forums. And, also, traditional methods of communication amongst appointed representatives of involved organisations. But by no means solely the latter.

So, given that discussion and communication is vital to engineers, and the traditional form of journal publication does not suffice, how should the contribution of, say, a research engineer be assessed? (For purposes, for example, of awarding a prize, or awarding tenure, or of getting an academic job in the first place.) I propose that such assessment also look at participation in these other essential communicative activities and not just traditional publications. I agree there is a problem of parameters and quality control. Just getting hits on your blog isn’t necessarily a good measure; but getting the most hits on your blog of anybody working in your area just might be.

To finish up: what forms of communication work, and how?

1. Obviously peer-reviewed journals and conference papers work.

2. Obviously WWW sites with journal-style papers work.

3. I would contend that moderated, selective forums such as the Risks Forum work.

4. I think some sorts of blogs work. I am sceptical of the frequently-written 200-word anecdotal variety of the sort the IEEE is promoting , but I do like the weekly-essay variety employed to such notable effect by people such as Nobel laureate Gary Becker and Judge Richard Posner at the University of Chicago in their blog. It is by following such blogs for a while that I believe I have come to understand what they are good for, and have started trying to emulate.

5. For specific purposes, such as the wider collection and dissemination of controlled information, carefully-moderated anonymous forums such as eurocockpit.com

These are all declamatory forms, with only limited possibility, asymmetrical, for discussion. What works for the kind of essential discussion I illustrate above?

6. Not anonymous WWW forums. I don’t yet know a forum which can be successfully followed unless one has lots of free time and a huge tolerance for purposeless commentary or for poseurs. For example, I have made two unsuccessful attempts to develop a presence on PPRuNe, the professional aviation people’s forum, and PPRuNe seems to me to be head and shoulders above anything else in which one can discuss aviation accidents. The main issue seems to be that moderation attempts are often overwhelmed by the task on high-interest topics, and no one seems to have a good solution to this phenomenon.

7. Yes, non-anonymous controlled-access WWW forums. Such as the York mailing list. (Note that its archive makes this list into such a forum.) A colleague to whom I once mentioned that I had been contacted to write a textbook on safety suggested that all I had to do was collect what I had written on the York list over the years and organise it. (Yes, well, the organisation part. It was simpler to start writing from scratch :-) )

8. Something that does not exist, but well might. Peer-reviewed or moderated (same thing, maybe?) non-anonymous forums for publication of essays and for discussion. There is a fundamental tension between encouraging comment, insight and debate, and insisting on quality. Quality means taking time over composition, which in turn discourages people from contributing. There are such forums at present, for example the functional safety area on the IEC WWW site, but they are not hives of intellectual activity.

9. Jan Sanders suggested using video. A forum in which engineering questions could be put, and engineers give their answers verbally in a video, and videoconferencing could be used to resolve, or at least further to discuss, discrepancies. Like written forums, this would be moderated to ensure quality. The advantage of videos would be that it takes many people less time both to record their views and to receive the views of others through speech than it does through writing, and speech is most effective when one sees the speaker speaking.

I am a fan of debating, I like mailing lists and, newly for me, blogs and I wish there were some way of professionally assessing contributions to these forms of communication.



AF447: Issues Clarified by the BEA Report

4 08 2009

There are some significant issues which are clarified by the BEA’s preliminary factual report, issued at the beginning of July: specifically the uncertainties and certainties in the meaning and partial interpretation of the maintenance messages received by ACARS; the question of structural integrity; the attitude and flight path of the aircraft on impact with the ocean surface; and the weather phenomena in the vicinity of the flight at the time it was presumed to be lost. The ACARS messages indicate strongly that there was a situation with unreliable airspeed indication. Since the accident more incidents of unreliable airspeed indications at high altitude have come to light. I comment on these continuing developments in a separate post. I comment here on structural integrity and what it tells us about how the airplane may have behaved; weather and position; contacts with ATC; and the interpretation of the ACARS maintenance messages received.

The vertical tail of the aircraft was the major piece of structure found during the search. It had separated, taking some parts of some fuselage, including box-section pieces, with it at one attachment. The question arose whether it could have separated in flight. Our collaborator, the aerodynamicist Clive Leyman, showed in work in June that it would not be possible above say FL 170- FL 200 to generate enough dynamic pressure on the vertical stabiliser of the A330 to cause it to fail. And even at that general altitude, overspeed would be a necessary contributing factor to any failure. His conclusion was based on dynamic-pressure calculations, based on the datum that the A330 vertical stabiliser failed during destructive testing at 2.0 times design load. The aircraft was cleared at FL 350. So we knew here from Clive’s work that an upset would have been a necessary precursor to loss of structural integrity. The main question thus is: what would have caused an upset?

Indeed, the BEA determined from inspection of the retrieved wreckage – over 600 individual pieces – that the aircraft hit the water intact, in more or less level attitude with a high vertical rate of descent. This does not conform with the flight path of an aircraft under full control. It suggests, indeed, that the aircraft was aerodynamically stalled when it hit the ocean surface.

The BEA determined, from the loading of the aircraft on takeoff and the estimated fuel burn over the flight profile that the aircraft had an estimated weight of about 205 tonnes and CG between 37.3% and 37.8% MAC at around the time of disappearance. The half-percentage variation in CG estimate comes from the fact that fuel is pumped around between fuel tanks at cruise, to optimsie the lift-to-drag ratio of the aircraft, and there is a limit of 0.5% MAC on the CG shift allowed to occur through pumping. There has been some speculation on the Internet about the margins between stall speed and limiting Mach number at FL350 and weight of 205 tonnes. The margin is some 80-100 kts; this is large enough to allow the pilots considerable leeway in dealing with any in-flight abnormalities, such as having to fly the airplane on “pitch and power” when airspeed indications are unreliable. However, severe or extreme turbulence could make dealing with abnormalities such as unreliable airspeed a very tricky control situation indeed, at any moderately high flight level. It is plausible that an upset could thereby have occurred. The BEA report is factual and does not speculate on this.

There was considerable convective activity in the ITCZ at the time of AF447s passage. The weather, though, was pretty typical for the time of year, and had no unusual features from the point of view of meteorologists. There was a convective mass extending about 400km E-W, which the route of flight of AF447 crossed. This convective mass had formed at about 0130Z by the fusion of four powerful storm masses, deriving from convective columns (“towers” in French) , which had reached their limit and spread out horizontally as their tops reached the tropopause. The strongest of these had attained its most powerful stage many hours before. At 0200Z, the cumulonimbus clouds forming the mass had for the most part attained their mature stage. Although there may have been new columns forming between the mature columns underneath the top of the spreaded mass, there is no evidence for that in the form of a later “overshoot” into the stratosphere, which happens in the case of the most powerful columns. The temperature at the tops of the mass was by and large similar to that of the tropopause, around -80°C, as recorded by satellite 7 minutes before and after the presumed passage of AF447. The tropopause was estimated by the climate model ARPEGE to be at around FL520 at the date and time of the disappearance of the aircraft. Another aircraft participating in real-time weather data collection via AMDAR passed along the route half an hour later at FL325, then climbing to FL350 and did not record anything unusual, confirming largely what one may infer from the satellite images.

The BEA says it is “very likely” that some of the cloud mass contained significant turbulence at FL 350. Electrical activity was also “possible” at this FL. But, crucially for those wondering whether the pitots iced up because the aircraft may have flown into heavy supercooled-rain clouds, the presence of supercooled water was said to be “not very likely” and would necessarily have been limited to very small quantities. I consider the developments with possible pitot icing in a separate article.

The last known position of AF447 was transmitted automatically over ACARS at 0210Z. This position was N2°58.800′W30°35.400′, or N2.98°W30.59° in decimal degrees. The position transmitted was that contained in the “Flight Management” data, which is partly based on the inertial reference system. It could be, said the BEA, that the GPS position differed slightly from this.

This position puts the flight in or close to the column of what had been the most powerful of the fused storms, whose column had attained its most powerful stage some many hours before and was at the time in its mature stage. The position is between ORARO and TASIL waypoints and looks to be slightly off the airway.

The last verbal contact with AF447 was by the controller of FIR ATLANTICO, in Brazil, at 0135:43Z. The controller then asked AF447 four times for his estimate at TASIL, without response. There were apparently three attempts at an ADS-C connection with DAKAR, at 0133Z, 0135Z and 0201Z. These failed with code FAK4, indicating either the absence of a flight plan, or a significant discrepancy between flight number, reported position, and planned position. Section 1.9.2 says that at 0146 the DAKAR controller asked for information about AF447 because there was no flight plan. ATLANTICO gave type (A332), airport of origin, destination airport, and SELCAL sign. DAKAR created and activated a flight plan, but there was no connection with the aircraft either on voice or ADS-C. So the first two ADS-C attempts were rejected because of, we may presume, lack of a flight plan with DAKAR at those time. The report does not determine whether the flight plan at DAKAR was activated before or after the last ADS-C connection attempt at 0201Z. Although the transcript of the exchange between ATLANTICO and DAKAR at 0135Z is included in the appendices, the later exchange is not.

As I mentioned in my note of 11 June, the order of the ACARS messages received does not necessarily reflect their order of occurrence. The reasons why are largely the reasons I gave there, with one addition. Fault messages received by the CMC are cached but not sent for a minute, to accumulate and summarise in one ACARS transmission other messages associated with that fault from other avionics devices. These associated messages are indicated by including the reporting device in the fault message compiled by the CMC (using a * for associated messages of type 2, which are not reported to crew because they have no “operational consequences”). There is prioritisation within the CMC, as well as possible race conditions from various BITE devices to the CMC, as well as prioritisation of transmission: the report explains how ACARS messages are prioritised by class. And, of course, possible delays in the transmission and processing of messages through the ACARS transmission system itself.

The interpretation of the messages is, as the BEA says, “delicate”. This is not just because of the indeterminacy of order, but also because, while a fault may be recorded, a subsequent return to normal is not reported; certain alarms such as overspeed are not registered; and although all faults (type 1) are accompanied by a cockpit effect (type 2), not all faults have their cockpit effect registered, and not all cockpit effects have the associated fault registered.

Of the type 2 effects, the BEA says it has not succeeded in explaining the meaning of the cockpit effect NAV TCAS FAULT (cockpit effect is a flag on the PFD and ND) but has explained the significance of the others.
There are five type 1 fault messages, of which the significance of two are unexplained:

the ADIRU2 fault (IR2), associated with messages from EFCS1, IR1 and IR3. The involvement of EFCS1 is a type 2 message, and it is suggested that the correlation window may have been opened by this message;
The FMGEC1 message that was the last received before the cabin pressure warning.

The BEA concludes that the type 1 and 2 messages taken together show that there had been unreliable airspeed measurements and their consequences.

That is it. Not a whole lot more than we knew in mid-June, but some of it more firmly established, especially the interpretation of the weather and the integrity of the airframe.



Avoiding Disaster on Takeoff

24 07 2009

It happened again! On 13 December 2008, a Boeing 767-39H suffered a tailstrike on takeoff at Manchester Airport. A tailstrike can occur on takeoff when the pilots pitch the nose of the aircraft too high in the air before it has lifted off the ground. This can occur when the aircraft is “rotated”, that is, the nose pitched up, to fly off the ground – and doesn’t fly off, so the pilot pitches the nose higher in order to get it to do so. The tailstrike is the symptom of a very dangerous phenomenon, as follows.

Why wouldn’t the airplane fly off? Well, before flight, various computers and software calculate the speed at which rotation should occur, known in aviation-speak as Vr, from, amongst other things, the total weight of the aircraft at take-off (TOW). If the TOW value is too low, then the calculated Vr will be too low and the aircraft will not fly off at Vr. When the aircraft is rotated, the aerodynamic drag also increases, so it accelerates more slowly. Not only that, but the TOW is used in calculating the thrust setting of the engines for take-off, which will also be correspondingly lower, so the airplane will have accelerated more slowly to get to the too-low Vr in the first place. So it’s triply bad: you took too long and too much runway to get to your too-low Vr and the act of rotation hinders you even further from getting to the true Vr at which the aircraft will fly off the runway. It is a very dangerous situation and accidents have happened. The crucial observation is that the TOW value is calculated from data delivered and typed into computers by humans, and humans can make inadvertent mistakes.

This incident was not the latest, merely the most recent I have found out about. In March 2009 a similar incident with an Emirates A340 in Melbourne, with almost three hundred people on board, came very close to being the worst accident Australia has seen. They got off the ground relatively safely, after the runway end, having taken some of the runway-end equipment and part of a shed with them, and returned to land safely. No one was injured.

Other notable occurrences: in June 2002 it happened with an Air Canada Boeing 767 in Frankfurt; in March 2003 with a Singapore Airlines B747 in Auckland; in October 2004 with a MK Airlines B747 freighter in Halifax, Nova Scotia, in which the aircraft crashed off the end of the runway and the few people on board died.

All because a too-low TOW was given to the devices which calculate Vr and take-off thrust. So shouldn’t the pilots carefully check the numbers used to calculated TOW before they enter them? Well, of course! Do they do so? Most certainly: all are aware of the dangers. And there are formal procedures, part of Standard Operating Procedures (SOPs) to help them do so.

So are the crews that do this sloppy? Badly trained? Incompetent? Should they be fired? Commentary on the professional pilots’ internet forum PPRuNe on the Melbourne accident has recently increased since the pilot in command (PIC) was interviewed by an Australian newspaper. One should always be careful when drawing conclusions from such forums because PPRuNe is anonymous, there are a lot of poseurs, a fair number of non-poseurs who are not pilots but interested, and I imagine even pilots who express views other than those which they really hold. Those caveats noted, the most frequently expressed opinion castigates the PIC for dereliction of duty.

Is this fair? After all, there are SOPs which if followed accurately are supposed to ensure correct TOW, and he knows the dangers. And isn’t he playing roulette with his passenger’ lives?

No, this is not fair. There are a number of reasons why not. First, as all major line pilots know, and NASA has recently documented in detail, the amount of distraction in an airline cockpit from outside sources during pre-flight preparations can be enormous. And distraction leads to error, SOPs or no. Second, as my colleague Bernd Sieker and I have recently found out through analysis, typical SOPs, considered as algorithms for getting the right V1, Vr and thrust setting are not very robust, considered by the standards applied to safety-critical computer programs. And an SOP is after all a sort of program, a human+computer program in this case. Third, this can happen to anyone. A pilot colleague who knows the Air Canada incident crew, and who was extensively involved in setting up his airline’s Flight Operations Quality Assurance (FOQA) program, to which things like flubbing TOW entry centrally belong, tells me the crew are “among [the] finest, most competent supervisory captains, highly respected leaders within the airline”. And finally, someone who accuses crew of playing roulette with their passengers’ lives forgets that they are equally putting their own lives at risk. Except for those very, very rare cases of murder-suicide (I think there have been only three in the quarter century in which I have been interested), this is always an off-base accusation. Everyone is sitting in the same fuselage.

Fair or unfair, though, is not the most pressing issue. The most pressing issue is how do we stop this happening again, maybe with 300 people dead rather than just bent metal? Compare: we have had five incidents in seven years, with 7 fatalities and almost a few hundred more. What is going to happen in the next fifteen years if we carry on as we have been?

There are three solutions. One is: better training, more attention to SOPs. Second is: internal aircraft weighing systems, whereby the aircraft can assess its own weight on the ground. Third is: more robust data-entry procedures for TOW calculations.

Consider training. Remember, “it can happen to anyone”. OK, more attention might be paid to this specific task, but this suggestion does not solve the underlying problem of human reliability under distraction.

Consider internal weighing systems. They do exist, but airlines choose not to pay for them. Why not? First, they cost lots of money. Second, I understand and can well believe that reliability, and ensuing maintenance costs, is an issue. Most such systems measure the compression of the oleo struts on the landing gear. Theoretically, it’s a great idea, but in practice it does bring data-integrity problems with it; and there is also the question of what one does when components fail: it is unlikely that the aircraft would thereby be grounded, so one would fall back on the human calculation anyway. That then comes back to the third solution: ensuring the human procedures are robust.

Besides all that, how much would it cost to retrofit the entire fleet of commercial aircraft with internal weighing systems? How likely do we think it is that this will happen? Much of the current fleet is going still to be flying in fifteen years, and what do we think are the chances that someone will buy the farm in a big way, in this way in that time?

Consider designing more robust procedures. Which I shall now do in a little more detail, because I propose that they are the most practical prophylactic measure. Bernd Sieker and I have written a paper on this, which we have submitted for publication in the technical literature.

Let me focus on getting the right TOW where it should be in the Flight Management Computer (FMC). We call this business a “engineered multi-agent cooperative” function, or EMC function. Let me not worry here about why we call it that. This function is executed by performing various human, automated, and human+automated subsidiary procedures, including data transmission, data exchange (through humans writing things down and typing them in, and through automatic means), calculation (both human and automated), and verification (checking that intermediate numbers have more or less correct values). Hence we can analyse it in exactly the same way we analyse multiprocessor computer programs, which also consist of combinations of sequences of small but precise actions.

First I observe that, writing down the sample SOPs in a form more similar to how one would write a program (dotting the “i”’s and crossing the “t”’s), the SOPs turned out to be rather more complicated that it appears at first sight. This immediately leads one to suspect that intuition (what Sieker and I call the Cognitive Model, CM) may not necessarily be a good guide to reality (the actual workings of the SOPs, which we call the Procedural Model, PM). There is a third model involved, namely the description of what the function does, “getting the right TOW where it should be in the FMC”. When we make this more precise, this is what we call the Requirement Model (RM).

The goal of demonstrating that the SOPs are adequate to achieve the function is expressed in technical terms by saying that the PM implements the RM; alternatively, that the PM refines the RM. It doesn’t do so as is, of course, without further assumptions, such as for example that the humans in the process aren’t deliberately trying to sabotage the function, or – one assumption which may particularly concern us when thinking of this sequence of accidents – that they don’t make random transcription errors and then read through those errors when cross-checking. Because humans are involved, and will stay involved, in the EMC function, and because the humans involved (the pilots) must supervise the process (amongst other things, they carry legal responsibility), we impose the additional constraints that the PM must implement the CM, and the CM must implement the RM.

There is a branch of computer science which arose about forty years ago which is concerned with checking whether programs achieve their goals. It is called formal verification. Techniques of formal verification can identify exactly where assumptions are needed in order to show that the PM implements the RM. Indeed, one can break down the PM into various subsidiary actions, like so-called procedures in a computer program. One can then consider the PM as composed of these subsidiary actions, and separately each of the subsidiary actions, thus breaking down the task of considering the whole into a number of distinct parts. One considers, for each subsidiary action, the preconditions for it to be started: how the world has to look when you start the action; and at the end of it the postcondition: what the action has accomplished when it finishes. Then you can chain them together by ensuring that the postcondition of one action ensures that the preconditions of the following action hold. This means of approaching the verification of programs is known as Floyd/Hoare logic after its discoverers forty years ago. Floyd/Hoare logic has a venerable and distinguished history in reliable computing. Like calculus, it is a standard technique which is unlikely ever to go out of favor.

Our paper points to certain obvious problems in the pre-flight TOW data entry. For example, we did some information-flow analysis. Information-flow analysis is a more recent technique, devised by Bernard Carré a quarter-century ago, and used by such ultrareliable-program development systems as Praxis’s SPARK (whose tool also uses Floyd/Hoare logic). We didn’t need to go too deep into our analysis to find some issues which need addressing.

The SOPs (and thereby the CM) from which we worked only identified two values of parameters used to calculate TOW: those in the preliminary load manifest and those in the final load manifest. So the CM has two sets of values: call them preliminary and final.

However, looking in detail at the SOPs and analysing what would happen if some human were to put in random values at various places, we could see that there were at least five quantities floating around the procedures which were all supposed to be the same as either preliminary or final manifest figures, but there was nothing in place to ensure (or “coerce” as computer scientists prefer to say) that these values actually were the target values. In jargon, there were five independent value sets in the PM which must be coerced into at most two set in the CM. This is known as a “data integrity” issue, and techniques to solve it abound in the technical literature on fault-tolerance. One can address distractions during human tasks (by invoking techniques such as “rollback”) and “finger trouble” (by including independent “sanity checks” at strategic points). And the Floyd/Hoare logic would then tell one whether one has resolved the integrity issue or not.

Sounds simple, doesn’t it? We note that for modern complex computer programs with many thousands to millions of lines of code it is not as simple as it looks. But we believe that for EMC functions typical of SOPs it is relatively simple. Many SOPs exhibit a complexity only as great as the kinds of examples one finds in textbooks and tutorials on Floyd/Hoare techniques.

There are lots of organisations with expertise in these methods, often with their own highly-developed SW toolsets. SRI Computer Science Lab in California and Praxis High-Integrity Systems in the UK are two of the most well-known. And of course the tech-transfer firm Causalis Limited associated with my research group. My point here is not to advertise, but rather to persuade readers that we are advocating an approach to avoiding tailstrike problems that is, for the scale of the application, mature in engineering terms, while being the least expensive of the avialable options for addressing the issue, and a lot less expensive than the likely consequences of continuing with the present strategy!



Software Engineering Ethics – The Sequel

22 07 2009

Further to the Gotterbarn/Miller study of software engineering ethics in the June 2009 edition of IEEE Computer, and my letter to the editors which I published here on 27 June, Professors Gotterbarn and Miller have replied to my letter. Both letter and reply will appear in the August 2009 edition of IEEE Computer.

Professors Gotterbarn and Miller write:

[begin G&M citation]

Our description of the Qantas accident was overly simplistic. Dr. Ladkin and other experts we have consulted agree that a problem in the Flight Control Primary Computer (FCPC) seems to be involved, in conjunction with anomalous “spiking” in one of three Air Data Inertial Reference Units (ADIRUs). There were reports of ADIRUs spiking in different airplanes earlier, but without the diving behavior of the Qantas incident.

The issue of data integrity is complex in avionics systems. These systems include multiple techniques to deal with possible false data readings, including the possibility of human pilot overrides and algorithms that attempt to distinguish between anomalies from errant sensors and actual emergency situations. Through a complicated series of events, at least one of these algorithms yielded a “false positive” identification of a dangerous stall-inducing climb that was “corrected” when the FCPC ordered a steep dive. This occurred twice during the Qantas flight in question. Interested readers can read
the interim report from the Australian Transport Safety Board . At this writing, the Board has not issued its final report.

We contend that complex system interactions like this create ethical as well as technical challenges for all involved. This case, no matter how badly Dr. Ladkin thinks we described it, deserves further study and public discussion. Even when bugs are obscure, life-critical software decisions are ethically charged for software engineers and for the people their software affects. We hope that larger theme is clear in the article.

[end G&M citation]

I find this a very reasonable response to the issues which came up. I agree that complex system interactions such as these pose ethical challenges, and am glad that my colleagues’ reply goes into more detail on the incident that caught my eye (and my pen). I also agree that this case, and not just this case, deserves further study and public discussion. But I think I disagree with what I take to be the implication that such “life-critical decisions” to which they refer were taken in this case by software engineers. I suspect, rather, that the decisions were taken by the avionics and aeronautical engineers who designed the kit and either did not anticipate the anomalies that manifested themselves or misjudged their significance. I would not expect that software engineers had had access to the relevant information to enable them to contribute much to those decisions.

More broadly, it is not clear to me what moral lesson we could draw from this lack of anticipation, for it is not clear that current hazard-analysis methods enable one to anticipate all such anomalies – indeed, it is rather clear that they don’t, which is amongst other things why I and my RVS and Causalis colleagues work in this area.



An Ethical Statement on Incidents

27 06 2009

Donald Gotterbarn and Keith W. Miller wrote on a Software Engineering Code of Ethics in the June 2009 edition of IEEE Computer magazine. They illustrate the application of their principles with some case studies, including Case Study 2: Who Is In Control?

They consider first the October 2008 Qantas accident, concerning which an
interim factual report is available from the ATSB. Gotterbarn and Miller say

[begin quote]

“ The software on this Airbus 330-303 implemented a decision to give instant control to the plane’s flight control system when the autopilot shut off because of computer system failures. The resulting nosedive suggests that this decision was not in the best interest of the public, especially members of the public in or below this airplane.

There are good reasons to have the flight control system protect the jet from dangerous conditions. But this incident illustrates that the decision to turn over control to the flight control system should take into account the current state of inputs into that system. The flight control system should have been more sensitive to the quality of its inputs and to the possibility of disastrous consequences for instantly reacting to apparent conditions that were based on erroneous inputs.”

[end quote]

This is philosophy without understanding. I am not even sure the authors know what a flight control system is.

They they go on to consider the infamous incident in Russia, during which a pilot let his kids into the cockpit and gave them a hand in flying the airplane. There was an upset and the airplane crashed. Gotterbarn and Miller say

[begin quote]

“After such a disaster, we would expect the developers of subsequent Airbus autopilot software to be particularly sensitive to issues of control transfer between pilots and autopilots.

In the Aeroflot crash, much of the publicity focused on the judgement of the pilot in inviting his children into the cockpit. While that appears to have been a contributing factor in the tragedy, the autopilot design was at least as significant.”

[end quote]

What an extraordinary comment! I think no more needs to be said about that than what I say, below, in my letter to the Editor-in-Chief of IEEE Computer:

[begin letter]

Dear Professor Carver,

Professors Gotterbarn and Miller (The Public is the Priority, IEEE Computer, June 2009:66-73) omit one important ethical principle favored by those of us who analyse incidents: refrain from making imposing public statements on technical matters about which you know little.

The authors illustrate well the reasons for this principle through their Case Study 2. They introduce the 2008 Qantas accident and suggest that a

“…decision to give …. control to the … flight control system when the autopilot shut off because of … system failures….. was not in the best interest of the public.”

Autoflight systems have been doing exactly this since they were invented over half a century ago, and no pilot or engineer I know would have it otherwise.

The ATSB preliminary analysis hints rather at an obscure bug with the Flight Control Primary Computer, as well as a yet-undiagnosed fault in one of the air data subsystems. Let us hope that our colleagues at the companies concerned are able to discover what and how and devise remedies.

One can indeed hold moral views arising from this and other incidents, such as that critical software and interfaces need to be rigorously proven free from every possible source of error, but most software engineers would agree that best practice is still some way from that ideal, and back when flight control systems were cables and pulleys we were not close to it either.

Concerning the Aeroflot upset, I feel strongly that children should not be placed at the controls of commercial passenger jets in flight, and that it is silly to suggest that the system design should accommodate such an eventuality.

Sincerely,

Peter Bernard Ladkin

[end letter]



AF 447 ACARS: A Mistake with a Life of its Own

14 06 2009

Here is yet another indication of how things can get a life of their own:-

Soon after the France 2 program showing the ACARS transcript messages on 4 June, someone on the pilot’s forum PPRuNe typed them up, and posted them to imageshack. Now they apparently made it onto eurocockpit.com . The New York Times’s Matt Wald, a reliable commentator, commented the ACARS messages yesterday, June 13 and there is a graphic on the NYT WWW site explaining the messages.

Wald said “its authenticity has been confirmed by industry officials”.

Except there is a typographical error, since corrected by the original transcriber. The ISIS message was a 3422 message, according to the original transcript (DG and Ind), but it was shown as a 3412 (OAT and Ind./Sensor) code message on the original image, and it is so shown now on the NYT WWW site.

This can be clearly seen in the screen grabs from the TV program from Danny Fyne, the PPRuNe originator:
http://www.pprune.org/rumours-news/376433-af447-2.html#post4975127
and in the higher-resolution screen grabs from contributor Machaca
http://www.pprune.org/rumours-news/376433-af447-2.html#post4975217

The list was typed up from the screen grabs by contributor selfin:
http://www.pprune.org/rumours-news/376433-af447-3.html#post4975386
The original version contained the erratic 3412 transcription for 3422, and has since been edited and corrected by selfin, as noted on the post itself. Here is the message, from contributor Captain-Crunch, in which the typo was first noted (with the *original* images to show it):
http://www.pprune.org/rumours-news/376433-af447-4.html#post4975726



AF 447 ACARS Messages: Reading Tea Leaves

11 06 2009

A list of the 24 ACARS messages listed by Air France that were sent from AF 447 between 0210Z and 0214Z on 1 June, 2009, the last information received from the aircraft, was shown on the France 2 TV channel on Thursday June 4. This list, in which incomplete information was shown, was typed up and distributed on the Internet (one must beware of typographic errors in the various versions which I have seen). Thus people started to interpret the messages and inquire about their significance.

I take it that people know what “reading tea leaves” means? Fortune tellers would look at the pattern of leaves left in the cup after the tea had been drunk, and wondering what they say about the future. Similarly, people (including myself, here) have been looking at the (partial) ACARS messages shown on the TV, and have been wondering what they say about the past. I adduce the comparison to propose a healthy dose of scepticism about what one can validly conclude from the currently publicly-available information.

The messages were listed in the following order (omitting messages which consist of maintenance warnings). The four-digit numbers are the Joint Aircraft System/Component (JASC) code, which I interpret from the FAA JASC Table and Definitions Document from February 11, 2002, which is on-line.

* at 3.5 hours before the main events, a 3831 event. Something concerning waste disposal (38 is water and waste, and 3830 is the waste disposal system)

* at 0210, a 2210 event: AP off (22 is Auto Flight and 2210 is the Autopilot system)

* at 0210, a 2262 event (22 is Auto Flight; I have no code 2260)

* at 0210, a 2791 event, flight control switch to alternate law (27 is flight controls; I have no code 2790 or 2791)

* at 0210, two 2283 events, flags raised on CAP and FO Primary Flight Displays (PFD) (22 is Auto Flight, I have no code 2283)

* at 0210, a 2230 event, autothrust off (2230 is the auto throttle system)

* at 0210, a 3443 event, a TCAS problem (34 is navigation; 3443 is the Doppler system. The Doppler system here is used to measure relative motion of another body, in this case another aircraft, for TCAS).

* at 0210, two more 2283 PFD flags

* at 0210, a 2723 rudder travel limiter fault (27 is flight controls, 2720 is the rudder control system). At higher airspeeds, the rudder travel is limited by the Rudder Travel Limiter; far less movement is allowed than at lower airspeeds.

* at 0210, a 3411 event with EFCS 2, reported by EFCS1 (3411 is the pitot/static system. I understand that on these airplanes, the system is divided into the pitot subsystem and the static subsystem).

* at 0210, a 2793 event involving EFCS 1. (27 is flight controls. I understand from colleagues that, on the A330, 2793 is the Flight Control Primary Computer, FCPC, also designated PRIM)

* at 0211, a couple more 2283 PFD flags

* at 0212, a 3410 event. A disagreement between the air data units, the AD part of the ADIRU (34 is navigation; 3410 is flight environment data). An “ADR disagree” can only occur when one of the three ADIRUs has already been designated as faulty by the FCPC, and the two remaining ADIRUs yield discrepant readings (this information from the Aircraft Operating Manual of the A330)

* at 0212, a 3422 event in the standby flight instruments (ISIS) (34 is navigation, 3422 is directional gyro and indicators)

* at 0212, a 3412 event involving IR2, the inertial reference part of ADIRU2 (34 is navigation; 3412 is the outside air temperature sensor and indicator). Reported by IR1 and IR3 and EFCS1.

* at 0213, two 2790 (EFCS) events, FCPC 1 and Secondary FCC (FCSC) 1 faults (27 is flight control; I don’t have the 2790 designator)

* at 0213, a 2283 event, reported by FMGKC1 (22 is autoflight, I understand from colleagues that 2283 is the Flight Management and Guidance Computer, FMGC)

* at 0214 a 2131 event (21 is the air conditioning, 2131 is the cabin pressure controller).

What about the ordering of these messages? First of all, they are time-stamped by the minute, so that orders them into five groups (the 0210 messages, respectively 0211, 0212, 0213, 0214). What about a finer ordering? That is going to be much harder. We don’t know whether this listed order is the order in which the messages were received (but Air France can probably tell us that). We don’t know whether the order in which the messages were received were the order in which they were transmitted (but maybe there is something in the code that can tell us that). We don’t know whether the order in which they were transmitted is the order in which they were generated (maybe Airbus can say something about that, but there might also be some indeterminacy). And, finally, we don’t know whether the order in which they were generated is the order in which the events occurred (that may be hard even for the manufacturer to say, because the rates at which values are sampled are very different, depending on the system).

For the purposes of a speculative interpretation, let me assume here that the events occurred in the order listed above. I do caution that this is quite a significant, and not necessarily correct, assumption. Let me further assume that the messages are veridical. For example, that the “ADR disagree” message really does indicate that the FCPC has ignored air data input from one ADIRU and is judging that the air data input from the other two are not consistent with each other. How significant this assumption is depends on whether one is a sceptic or an optimist about the reliability of these highly complex programmable-electronic systems and one’s trust in their design.

So here goes. The AP went off and flight control went to alternate law. Flags pop up. Autothrust disconnects, something with TCAS and then two more flags. Rudder travel limiter has a problem and then something with the pitot-static system that the EFCS’s have problems with. Sometime over a minute later we are told that the air data from one ADIRU has been designated unreliable by the FCPC and the air data from the other two disagree. Then the laser ring gyro in the ISIS complains, as do the primary and secondary flight computers (these systems are duplicated: it is the number 1 units of each that are complaining), something happens with the FMGC, and then there is a cabin pressure warning.

Why might AP go off and flight control go to alternate law? One possibility is (1) you’re being severely shaken around, or (2) for some reason the AP couldn’t maintain altitude. Another possibility is that (3) there was a system problem. Then the autothrust (AT) goes off. That would happen if, for example, that auto flight systems cannot maintain stable air speed (AS) and altitude. I don’t know what the TCAS notification would signify. Then there is a rudder travel limiter fault. That device has AS as input, so maybe there is an issue with AS sensing. Then EFCS1 thinks EFCS2 has problems with pitot-static sensing. The pitot system colludes with the static system to measure AS, and the static system is also used to measure altitude. Then EFCS 1 complains about FCPC (I take it that would be FCPC 1, also known as PRIM 1). Then two of the three remaining air data units disagree and can’t reconcile (we don’t know when the first was voted out by the FCPC 1). At a similar time, the DG in the stand-by flight instrument system complains. At a similar time, the inertial reference part of ADIRU 2 is faulted by the other two. Then unspecified faults with FCPC1 and FCSC 1, but it’s not clear which system component is reporting those faults. Then another flight control issue, and finally the cabin pressure controller squeaks.

There are some patterns here. One pattern is there is a lot of stuff involved with AS and altitude, and at least one with the outside-air-temperature sensors. The commonality here is the pitot and static systems and their interaction. Then later comes the DG in ISIS, followed by IR2 being voted out and then FCPC and FCSC faults and cabin pressure.

What could be up with the P-S systems? One possibility is that they are getting all iced up. That would be why AP and AT think they can’t maintain altitude. That might also explain the outside-air-temperature probe complaint, if it were being iced also. But manufacturers and regulators know about ice; it must have been extraordinarily severe to overwhelm the sensor heating systems.

Another possibility that some have mooted on the internet is that the aircraft was being blown around a lot in severe to extreme turbulence, but I don’t see how thereby one would get discrepant readings: rather, all probes would vary wildly, but coordinated, as individual gusts hit all three at more or less the same time. So I really don’t see that as a plausible reason for the P-S system issues.

The IR units are self-contained: they are calibrated sometime way back when and that’s it for the remainder of the flight. So when they start complaining, it is either a system fault or you are already out of control and moving them around more than they judge appropriate.

Severe icing alone overwhelming the sensor systems, though, does not by itself lead to an accident. The AC could be controlled with pitch and power, and the Aircraft Operating Manual explains exactly what pitch and what power setting in some detail, if one has an “ADR disagree” warning.

Severe turbulence, though, could cause a control problem if there are shears of more than 50-60 kts differential, because that is approximately the width of the speed band for that flight at its cleared flight level – this has been verified, using a conservative estimate of the aircraft’s weight at the time, by experienced A330 pilots (by “speed band”, I mean the difference between “maximum Mach operating” speed and stall speed). However, turbulence of that sort, while supposedly possible, is very, very unusual.

How do you get that severe icing overwhelming the PS systems? Temperature at that altitude is well below the freezing point for water, so clouds are generally formed from ice crystals. The properties of these are well known and the air data systems and their certification is aimed to cope with them, unless there is an entirely new phenomenon manifesting itself here. Ice crystals don’t show up on weather radar, so even with careful use of weather radar one might not fathom the presence of a storm whose water content is crystalline ice, no matter how violent that storm is.

The behavior of supercooled water droplets doesn’t seem to be as well understood. Water can become supercooled, even as low as -40°C (which would be a typical temperature for the flight level at which AF 447 was flying), especially in strong convective atmospheric currents. Water requires a certain amount of energy to crystallise, and if the air is cooling fast, adiabatically, that energy just might not be there. And if there is enough water, at -40°C, colliding with your sensors and freezing on impact, it may overwhelm the sensor heating and cause air data problems. However, supercooled drops are water and would show up on weather radar. One would expect a crew to avoid such an area being “painted” on their radar, especially in the Intertropical Convergence Zone (ITCZ) in which such storms are frequent, indeed expected. It is common for pilots to deviate many tens of miles from the planned track to avoid such storms, for avoiding the storm is the main priority, and use of the oceanic tracks is designed to accomodate such deviations.

So the severe-icing root-cause hypothesis is not puzzle-free.

What about some sudden, catastrophic structural-failure event such as the sudden in-flight break-up of TWA 800 in 1996? Any such hypothesis must accomodate the fact that parts of the electronics were muttering to themselves in a fairly orderly fashion, and transmitting those mutterings over a SATCOM link, for some four minutes. I don’t see how. (It is obvious that structural-failure occurred – the aircraft’s vertical stabiliser has been found separated – but, one would conclude, later in the accident sequence.)

That is enough tea-leaf reading for one note. We might hope that the BEA will explain the exact meaning of the ACARS messages, and its conclusions about their true ordering, in the interim report which, by ICAO rules, must appear within 30 days of the accident (so, by 1 July 2009).

If anyone has more detail on the exact JASC codes used by the airline and (very important!) can demonstrate to me that that information is reliable, I would be very glad to hear from you.