Concorde, Ten Years On, Part 2

9 12 2010

The Concorde accident to F-BTSC on 25 July 2000 is about as well understood as to causes as any accident can be. There is also, unusually, a more or less linear connection of causes from an exceptionally rare event: the deposition of a particularly hard and sharp strip of metal, which shouldn’t have been mounted in the first place exactly because of such possibilities, on exactly the part of the runway at which Concorde’s tires bear the greatest load – and the aircraft indeed running over it, and it’s not a big strip. The Concorde’s ground run goes up to just about 200 kts at rotation, I understand, compared with that of a Boeing 747 at about 160 kts. Furthermore, the delta wing generates some negative load, putting even more weight on the tires, at rotation, before it changes to positive and the aircraft lifts off. The sequence of events that then ensued was, as far as I know, not anticipated by anyone in the development or certification or analysis of the aircraft. To my mind, it is hard to see how it could have been. To me, this is a freak accident, the «not expected to occur during the operational lifetime of the aircraft», which is the strictest category of likelihood contemplated in civil aeronautical certification.

But some differ, for example Tom Ferrell in this note to the York Safety-Critical Systems Mailing List. Tom thinks the accident had precursors, which showed, in advance of the accident, that

Regardless of causal agent, the Concorde was susceptible to severe damage from a relatively common occurrence.

He means there had been tire burst incidents, which indicated problems with the design. So is this just a matter of personal taste, say, like wine? Ladkin tastes “freak” and Ferrell tastes “foreseeable” in the same glass, and that’s it? Or is there, as I would prefer to believe, an objective way of evaluating the views, such that one can be shown to be right (or more accurate) and the other wrong (or misleading) in some way?

I think it is partly a matter of what you lump together, and what you don’t. Do you lump together all tire bursts, including this one, and all damage, including this damage, or don’t you? Is this lumping arbitrary, a matter of individual perception? I don’t think so. I think there are objective principles, on which so far I have only an intuitive handle.

How to indicate these principles? I try to show them here by means of a hypothetical cross-examination of Ferrell’s claim. Here goes.

M’lud, regardless of causal agent, the Concorde aircraft was susceptible to severe damage from a relatively common occurrence.

I see, thank you, counsel. What was that common occurrence?

A burst tire, m’lud.

Thank you, counsel. And what was that severe damage?

A 32cm square hole in the lower wing skin, m’lud, which also served as the fuel tank skin.

I see. Had that ever happened before in the history of the airplane?

No, m’lud.

You say “susceptible”. Had damage ever occurred to the lower wing skin, except in this case?

Six times, m’lud.

And how many times was that due to your “common occurrence“, a tire burst?

The lower wing skin was punctured on five occasions when a tire burst, m’lud.

But that is not what I asked you, counsel. I asked you in which of these events the damage to the lower wing skin was due to the tire burst.

It is supposed, three times, m’lud.

You say “supposed“, counsel. Why so?

As far as we know, in those cases, m’lud, the damage sequence was causally initiated by a tire burst. It is conceivable, although very unlikely, that a contemporary but independent damaging event caused the lower-wing-skin penetration, but there was no evidence for that.

I see. Thank you for your care in phrasing this, counsel. And what were the two other events?

In one, on 29 January 1988, the tie bolts holding the two wheel halves together sheared, and in the resulting sequence one of the bolts penetrated the Number 7 tank, leaving a half-inch hole. In the other, on 15 July 1993, there was a braking-system jam, and the Number 8 tank was punctured as a result of the damage sequence.

So, if I understand you, counsel, you tell me that, before the fatal accident at Gonesse, three times it had occurred that the lower wing skin was punctured due to your “common occurrence“, a tire burst.

Yes, m’lud.

And how many years did the Concorde fly in service before the Gonesse accident?

Just over 24 years, m’lud. The first revenue flight was 24 May, 1976.

And how many flight cycles?

About 84,000, m’lud.

I see. That is quite a long time. And, to me, quite a large number of flights, although of course by no means so large as with most aircraft in commercial use nowadays. So are those three occasions a lot or a little, counsel?

With respect, m’lud, I offer no opinion on that question.

So there are these “common occurrences“, which had occurred – how many times, counsel?

Aviation Safety Network has a record of 55 occasions after service introduction in which tires burst, m’lud.

Common enough, I suppose. And these common occurrences had caused damage other than to the tire on – how many occasions, counsel?

Aviation Safety Network has a record of 28 occasions on which other damage occurred, m’lud.

Does that include the two above in which the damage was not initiated by a tire burst, counsel?

Yes, m’lud.

So there were 26 occasions on which, as far as we know, a tire burst initiated damage to other parts of the aircraft?

Yes, m’lud.

So I think you have established, counsel, that a common occurrence, a tire burst, could cause damage, and thus that the aircraft was susceptible to damage from this common occurrence. But you want to establish more than that, don’t you, counsel. You wish to say that the aircraft was susceptible to severe damage.

Yes, m’lud.

Is “severe damage” a technical term used in aviation, counsel?

No, m’lud.

So it is your term, counsel. What do you mean by it?

I mean that the safety of the flight is affected by the damage, m’lud.

Thank you, counsel. Is there any similar term used in aviation?

The U.S. National Transportation Safety Board Part 830 defines an “incident” to be an occurrence other than an accident, associated with the operation of an aircraft which affects or could affect the safety of operations. The same regulation defines an “accident” to be an occurrence [associated with the operation of an aircraft] in which any person suffers death or serious injury, or in which the aircraft receives substantial damage.

I see. Is there a definition of “substantial damage“, counsel?

Yes, m’lud. “…..damage or failure which adversely affects the structural strength, performance, or flight characteristics of the aircraft, and which would normally require major repair or replacement of the affected component. Engine failure or damage limited to an engine if only one engine fails or is damaged, bent fairings or cowling, dented skin, small punctured holes in the skin or fabric, ground damage to rotor or propeller blades, and damage to landing gear, wheels, tires, flaps, engine accessories, brakes, or wingtips are not considered ‘substantial damage’ for the purpose of this part.” This definition is similar to other definitions of significant damage, used in definitions of accidents and incidents in, say, the International Civil Aviation Organisation Annex 13, which defines reporting requirements for its member states.

Thank you, counsel. And in which of those 26 tire-burst incidents you enumerated above was “substantial damage“, according to this definition, incurred?

In the incident at Washington Dulles airport on 14 June 1979, m’lud. The performance of the aircraft was affected in that fuel was lost through the debris penetrations of the tank at a rate of up to 4 kg per second. It was unable to continue its flight to London. The aircraft lost 7 tonnes of fuel before it landed again at Washington Dulles.

And in others, counsel?

In no others, according to the definition, m’lud.

I see. Are there incidents in which a fuel tank was penetrated, in which the performance of the aircraft, its structural strength, or its flight characteristics were not substantially affected?

Yes, m’lud. On 29 January 1988, the incident in which the wheel-half tie-bolts broke and a bolt punctured the tank on take-off from London, the flight continued to its destination, New York.

I see. How large was this puncture?

The hole was half an inch, so about 1.3 cm, in diameter, m’lud.

So it appears that a puncture in a fuel tank, even a fairly large hole, does not necessarily count as “substantial damage“?

No, m’lud, it does not necessarily count so.

Are there any other common technical meanings of “severe damage” or “significant damage” which we might want to consider, counsel?

I think so, m’lud. For example, damage which could affect the safety of flight, the definition I suggested.

Could affect“, counsel, or “does affect“? For example, during the 29 January 1988, was the safety of the flight affected?

Apparently no, m’lud.

Was the safety of flight affected in any of the other tank-penetration incidents besides the 14 June 1979 incident at Washington Dulles?

I don’t believe so, m’lud.

Could it have been?

I believe so, m’lud.

How?

Maybe fuel streaming from a hole can catch fire when it meets engine exhaust, m’lud.

I see. Does it commonly do so, counsel? Do you know of any other incident in commercial aviation when fuel streaming from a smallish hole, such as this, caught fire?

Actually, m’lud, I don’t.

Are there any other ways in which safety of the flight could be affected by such a leak?

When the aircraft lands, m’lud, the brakes heat up, and leaking fuel could fall onto hot brakes and catch fire.

Has this happened, counsel?

Yes, m’lud.

Are there ways to prevent it happening?

Yes, m’lud. If a crew knows they have a leak – and if the leak is substantial you can usually see the stream behind the wing from the rear passenger seats during flight – then they can have fire services meet the plane on landing and cover the brakes and ground under the leak with fire-suppresant foam. This mostly suffices.

Thank you, counsel. So igniting this fuel is a event for which there exist known and effective countermeasures.

Yes, m’lud.

So although such an event “could affect” the safety of flight, it mostly doesn’t do so.

It appears not, m’lud.

So it appears that penetrations of the fuel tank in themselves do not count as “substantial damage“, and they do not necessarily count as damage which affects the safety of the flight. But they might count as events which could affect the safety of flight if we are sufficiently imaginative in devising scenarios.

It seems so, m’lud.

Let us see how imaginative I may be. As far as I understand quantum mechanics, atomic particles may engage in random motion, that is, displacement of position without apparent cause.

As far as I also understand quantum mechanics, m’lud, that is so.

So it could be, counsel, that all the atomic particles in a Concorde translate 4 meters to the left all at the same time, leaving the passengers sitting, well, somewhere in space outside the fuselage.

I suppose it could be, m’lud.

And those passengers would probably fall to the ground and injure themselves or die.

I suppose so, m’lud.

So it could be, counsel, that the Concorde, indeed any aircraft, suddenly leaves its passengers sitting outside the airframe, leading to serious injury or death.

I suppose so, m’lud.

I am, counsel, as you see, sufficiently imaginative in devising scenarios. You have presented me with two partially overlapping definitions of significant damage, of which the second is indeterminate between “could be” and “is“. I don’t find the “could be” interpretation very helpful, as you see, because I am, as you also see, sufficiently imaginative. And I don’t think any objective safety property of a commercial airplane should depend so heavily on my sufficient imagination. So I am going to interpret “severe damage” as meaning damage which is either substantial in the sense of NTSB rule 830 or which does (not “could” but “does“) affect the safety of flight.

Yes, m’lud.

On which occasions, then, did your “common occurrence“, a tire burst, initiate a causal sequence in which severe damage resulted?

On 14 July 1979 at Washington Dulles, m’lud, and on 25 July 2000 resulting in the crash in Gonesse.

The damage which resulted in the Gonesse crash was then, by definition, substantial, as well as severe, wasn’t it, counsel.

Yes, m’lud.

So, since this severe damage actually happened on that occasion, we can say that, even before this occurred, the aircraft was susceptible to exactly this severe damage, in the sense that, since it did happen, it follows that the aircraft was susceptible to its happening, simply through the usual meaning of the word “susceptible“.

Yes, m’lud, that is what I claim.

Let’s look a little closer at this word “susceptible“. There are some people who claim that human beings spontaneously ignite. Not often, but occasionally. All that is left is ashes. If that is true, and I believe that this is a very, very big “if“, then human beings are “susceptible to spontaneous combustion” aren’t they, counsel?

Yes, m’lud. But I share your scepticism of the phenomenon.

The point, counsel, is this. We know whether or not human beings are susceptible to spontaneous combustion only in so far as we know actual examples of human beings spontaneously combusting.

It seems so, m’lud.

And, further, let us suppose that there are certain circumstances C in which human beings spontaneously combust, and if those circumstances do not obtain, then they don’t. Then, surely, we are obliged, by virtue of not wishing to mislead our fellow men and women, to say that human beings are susceptible to spontaneous combustion in circumstances C and to indicate that, if circumstances C do not obtain, there is nothing to worry about.

That seems to me reasonable, m’lud.

And I take it that you do not wish to mislead me, counsel!

Certainly not, m’lud!

Then when you claim that the Concorde is “susceptible to severe damage [resulting from] a common occurrence“, which we have more or less agreed is a phrase which may be able to describe the Concorde aircraft, I now want to know if there are any circumstances C which you should be telling me about, under which severe damage resulting from your common occurrence, a burst tire, may be realised. Please note the condition: you are to tell me about circumstances C in which, if they obtain, the accident sequence results, and for which, if circumstances C do not obtain, the accident sequence does not result.

Yes, m’lud. The accident sequence is as follows. A titanium strip lay edge-on on the runway at or near the rotation point of the Concorde. It cut sharply into a tire, causing a tire burst resulting in at least two chunks of tire of size approximately 4.5 kg. It is presumed that one of these expelled chunks impacted the lower wing skin, the skin of one of the fuel tanks, causing a shock wave which blew out the fuel tank skin from inside, near the impact point of the tire chunk , resulting in a hole of size about 32 cm square being formed in the fuel tank, and of course fuel streaming out. The fuel ignited, and burned from very near the fuel tank hole, causing varying loss of thrust in two engines, as a result of which the aircraft was unable to attain positive-rate-of-climb flying speed, and was also subject to thermal damage from the fire under the left wing. As the damage progressed, control was lost and the aircraft crashed in Gonesse.

Thank you, counsel. How do we know that the loss of thrust rendered the aircraft unable to attain the appropriate flying speed?

That is elementary aerodynamics of Concorde, m’lud, and is not disputed.

Thank you. How do we know that the fire caused loss of thrust?

Calculations show that, if air is ingested into the engine intakes at a temperature approximating that of the burning fuel, thrust is lost at more or less the observed and recorded rate.

Thank you. How do we know the fire was present at exactly the unfortunate point to be ingested?

Photographs of the accident, m’lud.

Thank you.

How did the fire attain the state in which it was photographed?

We don’t know, m’lud. We would expect a fuel fire to start when it has been ignited by hot gases from the turbine engines, behind the engines, which of course were in reheat at the time. As far as anyone knows, the front of such a flame cannot travel relatively forward at the speed at which the aircraft was travelling.

So we would expect the fire to remain behind the wing structure, behind the engine exhaust?

Indeed so, m’lud.

But this fire didn’t. Its front came forward underneath the wing, and you have indicated we do not know why.

That is correct, m’lud. There is speculation that it might have been ignited by an electrical spark from some wiring in the undercarriage bay.

Do we know that, counsel?

No, we do not know, m’lud. It is speculation, because we cannot otherwise understand how the flame front came forward under the wing.

Thank you. So we basically do not know the causal sequence between fuel released from the tank and the the engines consequently operating at reduced power.

It seems so, m’lud. We do not know why the flame front moved forward.

But of course there would have been no flame front had fuel not been streaming out of a hole.

Indeed so, m’lud.

How big was the hole?

About 32 cm square, m’lud.

That is a big hole! Did such holes occur during any other of your “common events“, counsel?

No, m’lud. The largest was 1 inch x 1.5 inches, caused by metal debris on 15 November 1985. The second largest was the hole of size 0.5 inch diameter on 20 January 1988, which event we have already mentioned.

So this hole, in the Gonesse accident, was 160 times larger than the largest hole which had previously been caused, and 790 times larger than the second-largest hole which had previously been caused. That is an enormous difference, counsel! Why is that?

The hole in the Gonesse accident, m’lud, was not caused through tank penetration by debris, but through shock-wave convergence punching a hole through the tank skin from inside.

That is, if I understand you, counsel, a much larger hole, two to maybe three orders of magnitude larger than any that had previously occurred, made by a completely different mechanism.

That appears to be so, m’lud.

And are such kinds of events, reminding me of your phrasing “common occurrence“, counsel, common in commercial aviation?

No, m’lud. This occurrence of the phenomenon is unique in the history of civil aviation as far as we know.

Thank you for your frank answer, counsel. But people knew about this phenomenon, did they?

Military engineers knew of the phenomenon from battle-damage studies, m’lud. It is not clear if any engineer in civil aviation, if anyone involved with civil aviation, knew of this phenomenon before the Gonesse accident. After the accident, military engineers informed the accident investigators of what they knew.

And what of the tire pieces that caused this phenomenon?

It was shown by experiment, m’lud, that a piece of rubber weighing about 4.8 kg and travelling at a relative speed of about 120 m/s, that is something over 300 mph, which could in theory occur due to a Concorde tire bursting at the point in the take-off sequence at which it did, could trigger the shock-wave phenomenon with a proportionate loss of tank skin.

And you have said that two chunks of tire of about that size were found amongst the runway debris.

That is correct, m’lud.

Could any other phenomenon of which we know, say a tank penetration by debris consequent to a “common” tire burst (I use your phrasing), cause the release of a 32 cm square piece of the fuel tank wall?

Not that we know of, m’lud, no.

So we have only one explanation to hand of the known size of the hole?

That is correct, m’lud.

And this explanation, this phenomenon, is otherwise unknown in the history of civil aviation.

That seems to be so, m’lud.

Thank you. If I understand you, this phenomenon was triggered by the impact of a chunk of tire of about 4.5 kg or so?

As far as we can tell, m’lud.

And chunks of this size, of tire pieces or indeed of other material, are frequent, or usual, during your “common occurrences“, tire bursts?

Actually no, m’lud, they are not.

I see. Have they otherwise occurred in any of the tire-burst events, counsel?

Actually, m’lud, they have not.

That is, they are unique to the Gonesse accident?

It appears so, m’lud.

How did they occur?

The tire was apparently cut by a titanium strip lying on the runway near the rotation point of the aircraft, m’lud.

I see. Pieces of metal left lying on the runway cut Concorde tires into 4.5 kg chunks, apparently?

Not any pieces of metal, m’lud, according to experiments undertaken after the accident. Titanium. Titanium is unusually hard. Other metals just crush when the tires run over them.

I see. But titanium strips are to be found lying on runways every so often, I take it?

Actually no, m’lud. This is the only recorded instance ever of a sharp titanium foreign object lying on a runway with commercial operations. Of course we don’t know about the military, since they do not share their records.

Why would that be, counsel, that this is the only instance?

One reason, m’lud, might be poor record-keeping. Another reason might be that titanium is not used on aircraft in places in which it might fall off on a runway.

Oh! So why did it happen here, counsel?

A mistake, m’lud.

I imagine a very, very rare mistake, counsel?

Yes, m’lud. As I mentioned, this is the only recorded instance of such debris lying on a runway at a commercial airport.

So, if I understand you, counsel, the shock-wave phenomenon can only happen via large chunks of debris, and the only way in which large chunks of debris from a burst tire have been known to occur is in this very accident, through cutting by a titanium strip, debris of which there has not been another recorded instance, in part because the use of titanium in a way in which it might separate from the aircraft during take off or landing is proscribed?

That seems to be so, m’lud.

So the circumstances C in which your “common occurrence” can lead to “severe damage” are, as far as we can tell:
(a) a flame front in the streaming fuel from an unusually large hole “moving forward” through an unknown mechanism to burn under the wing, in front of the engine intakes;
(b) a fluid shock wave punching out the unusually large portion of the fuel tank wall to create the hole;
(c) an unusually large chunk of debris creating the shock-wave sufficient to punch out the unusually large hole;
(d) this unusually large chunk indeed impacting the fuel tank wall, rather then being ejected in another direction;
(e) a titanium metal strip lying on the runway near enough to the rotation point to cut a tire which happens to run over it into suitably, unusually large chunks.

That seems to be so, m’lud.

I conclude, counsel, in the words of your claim, that the Concorde aircraft is susceptible to severe damage resulting from a common occurrence (tire burst) under the circumstances (a)-(e) just elaborated. And that, in order not to mislead us, your claim should include the supplementary phrase “under the circumstances (a)-(e) just elaborated“. If you agree to include that wording, counsel, I shall grant your claim. If not, I shall reject it. Accordingly, do you wish to remain with your original wording, or to amend it?

Counsel’s reply is not recorded because the recorder had used up its batteries.



Concorde, Ten Years On

6 12 2010

I understand that Simon Foreman observed at a meeting of the RAeS Law Group on 28 April this year on the criminalisation of aviation accidents, reported here in Flight International by David Learmount, that the French legal system does not have a mechanism of the English legal system, the inquest, to determine what went on in an accident. It seems to follow that, in France, for the state to determine what indeed went on in an incident of public interest, there must be a criminal trial.

First point: there at least two reasons for society to determine what went on. The first is to prevent a recurrence. This is the reason for the ICAO-mandated accident investigation bodies, here the BEA. They have long done their job.

The second reason is to apportion responsibility for compensation, an age-old and widespread human activity. Concerning this second reason, it’s a shame for all that France doesn’t have inquests. I imagine many French people might agree. It is particularly harsh for one person by the name of John Taylor.

Second: why an inquest?

Amongst other things , the results of an inquest help figure out who should ultimately pay. There is an ancient general principle of compensating victims of mishaps and this should not only follow rules but also be seen to be “fair”, adjudicating amongst competing claims, and that is what an inquest does.

Some commentators, including the BBC in their report, have spoken of “gaining closure” for the victims’ families. This notion is a US import to Europe and not one with which I sympathise, even when I was living in the US. I don’t sympathise with it, in part because it gives cover to seeking revenge, an activity of which I expressly do not approve in the case of accidents.

In particular, an inquest is not a criminal trial. It doesn’t punish anyone. It assigns cause.

Third: some have speculated, as in this note on PPRuNe that this will be a bonanza for tort lawyers.

If this follows the time scale of most major commercial airline accidents, seeking compensation for victims’ families will be mostly over by now. The airline (that is, the airline’s insurance company) will have already paid to settle most or all tort claims, as is by now the general practice in commercial aviation. The cost is reported to be in the realm of €100 million.

Fourth, the ruling is reported to contain the following apportionment: Continental 70%, EADS 30%, everyone else (Air France, DGAC, Paris Airports Authority, etc) 0%. That means that the insurance company will be negotiating with those parties to recover the relevant proportion of its costs. Since there is now a legal ruling which will act as precedent, there would be little point in disputing it in court.

So that will settle the compensation bit.

Fifth, what is this ruling based on?

The ruling is based on the obvious physical ABC of the accident occurrence.

The report said: titanium strip fell off Continental onto the runway; Concorde ran over strip; strip sliced into tire and caused tire burst of unprecedented form and strength; large tire fragment hit tank; impact shock wave caused tank to explode from within; resulting hole allowed fuel to stream out in large quantity; fuel was ignited (not completely sure how, but probably by reheat); fire engulfed critical wing structure and contributed to critical performance degradation of two engines; Concorde cannot accelerate after TO on two engines alone (BTW, there is no evidence that Concorde was overweight at TO) and went down.

That’s what the court found also, as far as I understand the verdict (not yet having read it :-) ).

People have said “missing spacer“. Our work on that said: not causally relevant.

People have said “overweight at dispatch“. Maybe, but not at takeoff, as far as anyone can tell.

People have said “airport should have swept runway better“. Maybe , but that wasn’t a direct contributing cause in the intuitive sense of the above sequence of physical events. It would be like blaming the police for Fred’s broken jaw in a street fight because they weren’t around at the time. Thousands of years of legal tradition says the person responsible for the broken jaw is the person who threw the punch. So here: the court said that the entity responsible for the burst tire is the entity that left the titanium strip on the runway; and further, as I understand it, the person who mounted that titanium instead of an aluminium part (presumably because he was judged to have made a professional error: he should have known to mount a softer metal); as well as, to some degree, the people responsible for the aircraft design, even though (and others will agree with me here loudly) the airplane was a triumph of aeronautical technology, as well as the most beautiful artifact ever to have taken to the skies.

Other people (Continental, apparently) said the plane was on fire before it encountered the strip. The report, as well as all of the people I know who know about Concorde, indeed, physical common sense given the undisputed evidence of what happened, have no explanation at all of how that could possibly have been the case. The evidence presented is circumstantial – eye witness testimony from witnesses who were some way away from the scene. There is no physical explanation of the accident which coheres with that testimony at all, after ten years of thinking about it. I take it that that eye-witness testimony was rejected.

Now, that all seems to me, given the system, appropriate, fair, and straightforward.

What is inappropriate, in the minds of many including myself, is that it seems to need a criminal trial, rather than an inquest, to serve this necessary legal function of apportioning the enormous costs of compensation.

It seems to be particularly inappropriate in the case of poor Mr. Taylor. He repaired an airplane. Imagine a wise-owl supervisor, or some angel with perfect foresight, going up and saying “You can’t mount that there! It might fall off in Paris, and Concorde might run over it and lose a huge chunk of tire which causes a fuel tank to explode and dump fuel into the exhaust and lose power and crash!” and him saying “oh, yes, you’re right” and changing it, as it was in Dickens’ Christmas Carol.

Dickens notwithstanding, to English minds there just doesn’t seem sufficient proximity between act and event to justify a criminal-negligence connection. Dickens’ tale was, after all, a Carol. But there he is now, poor chap, with a criminal record, and a 15-month suspended sentence. Mr. Taylor, on behalf of many, probably most, Europeans, I am very, very sorry!

And that is why people are going to shut up and instruct their lawyers, rather than telling accident investigators all about everything they know, if accidents continue to be criminalised. Just as they are already known to do in rail accidents in Germany, for example.



On A Misleading Trope in System Safety Engineering

18 11 2010

Actually, the trope is the second of four topics I wish to address

I recently exchanged opinions with Michael Jackson on the use of mathematics and logic in software development (his main interest) and system safety engineering (mine). If I understand him right, Michael believes that a story must be told about how mathematics and logic applies to clarification of things expressed in natural language, and about how mathematics and logic applies to the world. I have what I believe is a simple, direct story, which comes first.

I believe that people working in system safety engineering often form opinions about use of mathematics and logic which are wrong. These opinions propagate amongst sympathetic hearers, and thereby become tropes. I want to address here one such trope. This comes second.

Third, I comment further about the Hazan I discussed earlier in Progress in Hazard Analysis.

Finally, I say a word about the justification for model-theoretic semantics for first-order logic, as proposed by Alfred Tarski some 75 years ago.

First, the story about applying mathematics and logic. There is some story about how natural language works. I don’t want to suggest here what that story is; just that there is one. Whatever that story may be, I hope it is compatible with some sort of Austinian naive realism, as follows. When I assert “the sky is blue now“, I am referring to a real object, the sky, ascribing to it an objective property, that of being blue, and what I say is true or false, or indeterminate if it approaches a color boundary uniformly (say blue-grey), or it is disproportionately covered with cloud and so on. The story will explain how I refer (to a real object in «the world»), how properties are ascribed, and what counts as success for such an ascription («true») and what as failure («false») with some middle group («indeterminate»).

Exactly the same story, whatever it may be, holds for a formal language. Whatever the story is about how reference works for designatory terms, how property ascription works, how truth and falsity is determined, works the same way for a formal language as it does in natural language. Whatever I can say, and however I can say it, I can also say it formally in just the same way. The only ability I need to be able to say something formally which others might have expressed in natural language is the ability to parse.

Mathematics in its function of applying to the world, and logic similarly, extend the naive view of language expressed by the sky example in that they are schematic ways of doing things. That is, they generalise over concrete instances and demonstrate to us inferences we may perform that lead from true things we know to true things we were not necessarily sure about but now may be sure, after the inference has been performed. I may derive an answer to 456 + 7491 = ??. I may accept A and accept B, and want to know what I can assert about both in one assertion: for example, (A && B), or (A OR B). Alternatively, attempted inferences we wish to make do not go through, for example we know (A OR B) and want to know what this tells us about A (answer: nothing), and we are thereby induced to discover additional truths which may make them go through, or conclude indeed that our wished-for conclusion might be wrong.

Whatever the story is about how all this works, it works with other things also. I know the law forbids me to drive my car immediately after the “30″ sign at 50 kmh, and I do not have to tell myself a complicated story about abstractions and symbols in order to explain the summons that comes in the post after I do so. Few of our fellow citizens who had problems with math and other abstractions in school have difficulty understanding why that summons arrived. So some story of how general schemes work is available to them, without it having to rely on those difficult topics which they could not master in school. Naive realism works for them. For the story about how engineering mathematics works, or business book-keeping for that matter, I am also inclined towards some kind of explanation along the naive-realistic lines proposed by Penelope Maddy some while ago (Realism in Mathematics, Oxford University Press, 1992).

If one doesn’t go for some such account, then it seems to me that the only available alternative is to explain the success of business book-keeping in terms of some sort of tacit convention, as John Searle explained concerning the use of money (The Construction of Social Reality, Penguin, 1995). We somehow “agree” to use arithmetic to come to agreement on transactions. But then the question arises why we use Peano arithmetic (PA), that is, the calculations that are theorems of PA, rather than some other calculations which are not. One can show certain advantages. For example, if I agree with the merchant that I may buy two items and pay him the Peano-sum of the prices, we can merge transactions rather than having to make two separate consecutive ones. But then, a story still has to be told of how PA correctly predicts what change I have in my pocket after either form of transaction, and that this turns out to be the same whether I make the one transaction with the sum, or the two transactions consecutively. PA is obviously more useful than other alternative calculations and a purely-conventional account must explain how this comes to be. A Maddy-type realism account gives a straight answer.

Logic is the science of inference. It says what conclusions follow from what premises because of their form (and how), and provides some methods of determining conclusions that may validly be drawn and those that may not be. It is a science of the use of language. We may and do put it in mathematical terms nowadays for the same reason, I suggest, that Newton was able so to formulate his natural philosophy, and merchants, much earlier, were able to do so for their transactions, and invented 0 to help them. There are certain generalities we discover that may be put in a much more perspicuous and manipulable form by such techniques than if we continue to use our everyday words to work with them. But the principles remain the same. Naive realism suggests that no new story must be told to explain how this works than must be told to explain how language works in general (whatever that might be).

There are some people who want to say “formal logic is not very useful in engineering“. Formal logic is one way of putting the science of inference in manageable terms, and inference is especially useful in engineering, indeed it is one of the main two or three activities in engineering, the others being bending metal and talking to liability lawyers :-) .

Second, the trope. There are some people who want to say “formal logic, say first-order logic, has Tarskian model-theoretic semantics, so one needs to have a model of anything expressed in first-order logic, and this model is mathematical, and mathematical abstractions cannot be a perfect guide to reality; there’s always some misfit“. I see that kind of argument a few times a year and I think it is balderdash (to use another word starting with “b”). Since I see it a few times a year, I conclude it’s a trope. Since it’s balderdash, I conclude it is misleading.

First point: I use first-order logic (FOL) to talk about inferences in language. No one can force a Tarskian semantics on me purely through that use. (I might still want to use one, though, and the final portion of this note explains why.) When I talk about the sky, I have in my FOL-language a term “the-sky” that designates that (here is interpolated the usual story about reference, identical for my NL talk and for my FOL-talk, whatever that story might be). So, in the trope, that step to model-theory fails.

Second point: it is demonstrably not true that there is always some misfit between mathematical “abstractions” and reality. Peano arithmetic tells me exactly what the reality is about monetary transactions. There is no misfit. If one might be worried about Wittgenstein’s point (actually, Kripke’s point which he attributes to Wittgenstein in Wittgenstein On Rules And Private Language, Harvard University Press, 1982) about how one can tell if one is applying a rule, let’s restrict talk about Peano arithmetic and reality to all combinations of transactions in £ and ¢ which lie under £1.

There are a few logical Luddites in system safety, but I don’t know any who go as far as to argue that you always have to remain wary of your demonstrably-correct arithmetical calculations in case the world just doesn’t fit arithmetic exactly. That formalising one’s inferences should be somehow metaphysically suspicious while formalising one’s arithmetic is not seems to me to be plainly inconsistent. If you asked a civil engineer to design a bridge, and heshe gave you a design, and you asked for the engineering calculations, and heshe said “well, we don’t have any, because statics relies on calculus and arithmetic, and calculus relies on arithmetic also, and, well, we are suspicious that arithmetic may not be a helpful guide to the way the world really is“, what would you think? I know what I’d think, and I may be tempted to put it in writing and send it off to the engineering institution that chartered that engineer.

Third, progress (unfortunately not so much) on the Hazan example. The reason I addressed the trope above is that it arose during discussion on the York list of the Hazan example which I treated a couple of notes back in Progress In Hazard Analysis. Here is where we seem to be (to be stuck?) on the example today.

Recall that Daniel Jackson formalised the analysis in Alloy, and found that the initialisation conditions were not addressed in the STPA analysis. Nancy Leveson says in a reply that these conditions are handled later in the book, but does not provide a citation.

Daniel included his analysis in a set of slides he produced for How To Prevent Disasters, his Keynote talk at SIREN//NL in Veldhoven on 2 November 2010. A “more polished” version of the Alloy model is at this associated URL.

I had another question of the STPA analysis, posed in this note: how does one show completeness of the high-level safety requirement? What this meant was not apparent to everyone; for example see Andrew Rae’s response. Translated, using the terms introduced in Formal Definition of the Notion of Safety Requirement, the question means: how do you show that you have captured all the necessary safety properties expressible in the limited vocabulary that you are using at this stage, and not captured more than you need?

If the answer is to be “this is impossible” (the original answer, but possibly misunderstanding the meaning of the question, from this note) then the answer is simply wrong. If the answer is to be “this is not possible in [XYZ proposed Hazan method]” then I would observe that [XYZ proposed Hazan method] forgoes an objective test of its quality, which other methods (such as OHA) have shown how effectively to incorporate.

To date, I have seen no technical answer to the question, despite subsequent discussion about the matter on the York list.

To summarise, (1) Daniel Jackson’s analysis using Alloy showed a weakness of the specific analysis of the example. I am not familiar enough with STPA to know whether it requires a check for initialisations. Since the example evidently did not include one, I would imagine not. If this is so, it seems to be a weakness. (2) It appears that STPA does not explicitly require a check that all necessary safety properties expressible in the vocabulary at a certain development stage have been captured; and no more than necessary. This is possible within certain limitations (indeed OHA shows how to go about it). A method which does not do this foregoes an objective test of its relative success.

Fourth, and finally, the model theory of FOL and its foundations. It was invented by Alfred Tarski in the 1930′s. I met Tarski in the 1970′s (I did my Ph.D. in “his group”, the Group in Logic and the Methodology of Science at U.C. Berkeley). Tarski was very clear that everything he did was science. He treated the subject matter of mathematics as no different from trees or galaxies. As far as he was concerned, he was discovering things about the world. He would have had no patience with people who argued “it’s a model, and models cannot reflect reality perfectly“. I think he would have said “models are reality; just bits of it, not the whole thing“. He thought the encapsulation into models was a form of closed-world assumption in the same way as, when we talk about the vagaries of chess, we do not need to invoke quantum mechanics or the make-up of distant galaxies. I imagine he would have said: in the same way in which if we are talking about the hazards to an A380 aircraft taking off from London Heathrow airport, we do not need to invoke the weather in Beijing. The language of a model is a restriction to what is relevant, no more and no less. And to determine what is relevant we have criteria which are outside the model itself, but which are no less clear for that. So people invoking the trope considered above are also, in my view, being unfaithful to the justification with which model-theoretic semantics for FOL was proposed.



Formal Definition of the Notion of Safety Requirement

9 11 2010

This essay concerns the theory of safety requirements, how they may be defined. I am not concerned here with practical methods of determining them. The concepts here may act as a touchstone for evaluating practical methods of determining safety requirements.

A hazard is defined in Leveson’s text Safeware (Section 9.3, page 177) as a system state (or “set of conditions”) from which, in conjunction with some state(s) of the environmental, an accident will inevitably result. MIL-STD-882D also defines a hazard as «conditions» that somehow may result in an accident. The advantage of Leveson’s definition is that it is precise, when you know what a state is. It is fair to imagine that «condition» refers to state, as contrasted with behavior, which I take to be best construed as a sequence of states, as in the ontology in my on-line book. However, the development here also holds true for hazards construed as sequences of states – behaviors. Any practical aspects, though, thereby become combinatorially much more complex.

Hazard (states), then, are defined. They form a set, HazSt. For all real systems, HazSt is finite, since the set of all system states is finite. It follows that HazSt is describable in a sufficiently expressive formal logic L. That is, HazSt is the unique model (up to isomorphism) of the description in this logic.

From this, we can define the safety requirements. First, for any sentence Q in L, let States(Q) be the set of states in which Q is true. Then the formal safety requirement SafeReq is defined to be the following sentence:


SafeReq = \/ \/{ R | States(R) subset-of HazSt}

That is, SafeReq is the disjunction of all assertions which define some subset of HazSt. SafeReq is a single sentence (of the sufficiently expressive logic). It is a sentence which defines exactly the complement of HazSt. It is perhaps necessary to note that the way in which SafeReq is defined here gives no clue to its practical determination. For one thing, this definition obviously includes many redundant conjuncts. For this reason, one may take SafeReq to be any logically equivalent sentence to this one.

Now, there are three ways I can think of defining a safety requirement using SafeReq. One is: (a) the system shall never be in a hazard state. Another is: (b) the system may be in a hazard state with likelihood at most P (let’s take a frequentist interpretation: it is in a hazard state up to P proportion of the time, and it is not in a hazard state at least (1-P) proportion of the time). Suppose (c) one wants the probability with which the system may be in a hazard state to depend upon the hazard state. Then one can use the construction which I elaborate, and clause (b) above, for each hazard independently.

So the exact safety requirement is, for a safety requirement of the form (a), SafeReq; for a safety requirement of the form (b), Probability(SafeReq) less-than-or-equal-to (some low likelihood); and for a safety requirement of the form (c), a conjunction of the per-hazard individual safety requirements of form (b).

Completeness is obvious: this is an exact expression, and all of these notions (a), (b), (c) are exact given that SafeReq is. If I have an exact expression of what I need to avoid, that is as good as it can get, theoretically. This expression is dependent, of course, on the definition of the notion of hazard; here, following Leveson’s definition, the notion of hazard is exact, but where it is vague, any notions defined in terms of it will inherit that vagueness.

The same construction also works when we consider hazards to be global states (that is, system state + environment state), when the relevant environmental parameters are state-like. And indeed, one can bring almost any hybrid system into this construal; Leslie Lamport showed how in the nineties in the paper Hybrid Systems in TLA+. The same construction also applies if hazards are environmental states.

One can construe Hazan hereby as an attempt to determine SafeReq in a practical fashion. If one regards OHA as a practical approach to Hazan, as we do, then there is a suitable notion of “the best we can do, logically” at a given stage of the refinement. I assume in readers some understanding of what constitutes a formal refinement using formal languages.

Let L.k be the language of refinement level k. For Q in L.k, note that the refinement in an OHA converges to L (one hopes! Although for combinatorial reasons it will likely never reach it). Let Q.L be the translation of Q in L (indeed, Q.L will normally be identical to Q, since we may assume that L includes L.k). Let States(Q) be the set of states in L (the language of SafeReq, so states-in-L are exact representations of the real states) in which Q.L is true.

I define: Q is a sufficient safety requirement for Level k iff States(Q) superset-of HazSt. Q is a complete safety requirement at Level k if Q is a sufficient safety requirement for Level k, and additionally Q implies all sufficient safety requirements for Level k.

None of this suggests practical ways to achieve or express any of these concepts. But it does show that an exact safety requirement of one’s favorite sort exists in theory, in the sense of being exactly expressible in some sufficiently rich language, for these concepts of hazard and system. It also shows that a Hazan which follows successive formal refinement, such as OHA, has a well-defined notion of sufficiency and completeness of safety requirements for each individual refinement level. Given that these notions of sufficiency and completeness are defined for each refinement level, one may inquire, formally, if one’s identified safety requirements at a given refinement level are sufficient, respectively complete.

This surely renders moot any discussion of whether completeness of safety requirements is “possible”. The question becomes whether achieving exact completeness in Hazan is practically achievable. In particular, for OHA and other refinement-based Hazan methods, the question of completeness of the safety requirements at a given refinement level is exact, and one may well take the view that it should be answered level-by-level in course of the Hazan.

I thank Daniel Jackson for helpful discussion.



The Parable of the Exploding Apples

9 11 2010

I thought up the following parable in order to show the value of particular sorts of formal completeness during hazard analysis (Hazan). Contemporary Hazan strikes me as a procedure or procedures in which clever, knowledgeable people sit down together, think about all the things which can go wrong and list them, and stop when they think they have thought of them all. Various rituals with their various names go with this, but the results are consensual and not necessarily objective, in that important properties of the analysis, such as forms of completeness, are not assessed or assured. Our technique, Ontological Hazard Analysis (OHA) performs Hazan in a formal way which ensures certain objective properties of the resulting analysis, rather than just the consensus arrived at by traditional techniques.

There’s a big apple orchard in the countryside, not far from a big housing estate, and families love to go there and picnic under the trees. But there is a problem.

When the apples fall to the ground, people pick them up and eat them, take them home sometimes. Every so often, one explodes, and people are harmed.

We do have some sort of a test for good apples, ones that will definitely not explode, but most apples we just don’t know about, and it’s not practical to insist that the ones which may explode are cleared up, because there are just, well, way too many. It’s impossible to collect them all up and check them. For one thing, we don’t seem to be able to tell when we have got them all.

The ownership of the apple trees is scattered over many people, and each person’s trees are distibuted over the field, so one can’t, say, put a fence around Joe’s, and a fence around Mary’s trees, and so on.

There is a safety assessor. He (for he is male) comes around at harvest time, and asks all the apple tree owners what they have done to ensure safety. He asks Mary. Well, says Mary, I have this list of exploding-apple experts with decades of experience and CVs longer than the road you took to get here, who came last Friday, looked around in the vicinity of my trees, debated at length using special words I am not very familiar with, and picked up some apples consensually thought to be possibly bad. He talks to Joe. Similar story. Indeed, he talks to all the apple owners.

Then he comes to me. Well, I said, I injected my trees at the beginning of the season with a special substance which gets into the apples, and which I can sense using this magic wand. The wand isn’t very sensitive, but when it pings there is an apple of mine within two meters of it. Now, I used this to pick up all my apples and inspect them. The ones that are obviously good I put back on the ground for people to enjoy. The ones I couldn’t tell, I painted blue and put them all in a basket. Then I put the basket in that bunker over there, so when one explodes no one will be harmed.

That sounds pretty good, agrees the assessor, but how, he asks me, could I possibly have picked up all my apples, for there just seem to be lots and lots? Well, I said, I went to contraption-school and not apple-growing school, so I devised this contraption with 40 arms and sensors. It goes to where my magic wand said there’s an apple. Because, you see, where my magic wand says there is an apple, I know from experience that there’s somewhere between 3 and 37 apples within two meters; there are never more than 37. My contraption picks them all up simultaneously, waves them fast one after the other at the wand, and when the wand peeps loudly, it puts it in that big basket over there. So we do four square meters, accurately, in just a little bit longer than the time it takes to pick up one apple. And we go over all the places near to my trees. So that solves the collection problem.

Then we take the big basket, and do the good/don’t know test, paint the don’t-know’s blue and put them in that blue basket over there. And we put the good ones in that yellow basket. Then we take the blue basket to the bomb bunker, the one over there with Safety Requirement painted on the lintel, and then we chuck the contents of the yellow basket back in the orchard.

That all sounds pretty good, said the assessor, but do I have documentation that all this happened as I said. Sure I do, say I, here it is. And in five minutes looking through it he is convinced.

But why bother doing this? He asks me finally. There are sooooo many apples in this orchard, and yours are only a few of them. You don’t do all of them. I say there are a few reasons. First, if an apple explodes this season and someone gets hurt, then you know right now it’s not one of mine, so that will save you time and resources reinterviewing me about it, and it saves me the cost of possible recompense, because we know in advance ’tain’t one of mine: you’re holding the proof in your hand. Second, you’ve gone round the other growers interviewing them at length and satisfying yourself at length whether the experts they hired really did an adequate job. Whereas, with me, it’s taken you ten minutes and you have proof of adequacy of my measures. That has saved you time, it’s saved me time, so we can go off and do something else with the time we saved, and the evidence you have in hand is better than what you get from the others.

So, he says, why don’t we insist that all do what you do? Well, you probably can’t, say I, because there are sooooooo many apples, and besides not everyone has this lucky property that my trees have, that there are only ever less than 37 apples per four square meters round a tree. But maybe we can hope, says he, that at least some other people will be able to use your technique, and then we can all save a little more time, be a little more sure that more apples are safe, have maybe a couple fewer accidents this year, and all in all improve the quality of our product!

Yes, and why not? say I.

The assessor returns to headquarters and tells his chief, Mr. Golden-Apple-Guru, what he has just seen. Oh, don’t be silly!, says Mr. GAG. That’s all nonsense what Ladkin told you. He can’t have done that. Everyone knows that Completeness Is Impossible! You’re fired!

Key

Many definitions of hazard construe hazards as states. The apples represent states. Whether they are system states, or states of the system+environment, depends on your definition of hazard. MIL-STD-882 and Leveson’s Safeware define hazards as system states.

Hazards are supposed to be conditions in which the chances of an accident are increased. For many conceptions, it is also important (although unstated) that for an accident to happen, the system must pass through a hazard state. (If this condition is not fulfilled, then one exposes oneself to possibly mistaken calculations of risk, as shown in my book chapter Problems Calculating Risk Via Hazard .) The exploding apples, or, rather, their immediate consequences, represent accidents that happen through a hazard state. It is important to note in the parable that all that explodes is an apple.

“My” trees represent those yielding hazards at a particular development stage. In Ontological Hazard Analysis (OHA), these development stages are formal refinements of (earlier) stages. The apples “my” trees shed are the hazards identified at this stage. Daniel Jackson has proposed that HazAn may be performed entirely before one starts on the software development, in this note to the York safety-critical list. Whether this turns out to be so, in the typical OHA there will be many refinement stages before one gets to the point of proceeding with software development.

Daniel has queried the point about knowing there are only at most 37 apples to pick up around «my» trees (private communication). This was my attempt to indicate a level of control imposed during OHA, a deliberated restriction of vocabulary so that the system states may be enumerated (or their relevant equivalence classes may be enumerated) and sorted into hazards and non-hazards as happens in the parable. My special picking tool is intended to show that one applies a specific technique for this sorting at a specific refinement level; we have developed no effective general techniques for so doing and doubt that there are any.

I sent a version of this parable to the University of York safety-critical systems mailing list on 01 November 2010 at 08.33. It was intended to illustrate the values of formal analysis, contrary to claims such as that it is «impossible» to establish completeness of the high-level safety requirements unless you ignore most of the relevant factors, a suggestion which appears in a response of Nancy Leveson to a question of mine in a note in the York thread of 31 October 2010 at 11.15. I believe such claims to be mistaken, but they are also widely believed, as may be seen by reading wider in the thread. Whether the parable helps to counter such claims is, of course, up to the reader to judge.



Progress in Hazard Analysis

22 10 2010

Hazard analysis (Hazan) is one of the necessary skills of a safety-critical systems engineer. In a post to the University of York Safety-Critical mailing list entitled software hazard analysis not useful?, Daniel Jackson proposed that, in my interpretation of what he says, as far as software development goes any hazard analysis may be performed “up front”, at the beginning, and the safety requirements thereby identified carried through the software development as normal specifications would be. Nancy Leveson replied that he didn’t seem to understand Hazan; that his comments referred to Failure Modes and Effects Analysis. But Daniel is a quick learner (he is, after all, a Professor of Software Engineering at MIT!), so one wouldn’t presume this to be the definitive last word on his proposal, as indeed it wasn’t.

Readers can follow the thread, starting from Daniel’s post, using the York archive tools.

I suggested that it is hard to argue about such matters in the abstract, and an example would be helpful. Daniel duly provided one, announced on the York list with a note on his WWW site, accompanied by the worked example checked in Alloy.

Jan Sanders proposed an Ontological Hazard Analysis (OHA) of the example, a first version of which I put on our RVS WWW site on Tuesday, and after having taken account of comments received (most via the York thread) and the discussion there, as well as discussion with Jan and Bernd Sieker on Thursday afternoon (21.10), I posted a revised version this morning: Leveson’s “Safer World” Interlock Example with OHA.

Daniel and we agree, and thereby differ from Nancy, that formally checking reasoning, using simple logic, is essential to responsible systems analysis. Daniel uses Alloy; we so far perform OHA by hand and haven’t developed a checker. This was brought home to us (not that we didn’t agree with it already) when Michael Jackson (not no relation) pointed out a reasoning mistake in our reconstruction of a hypothetical client interview. Daniel’s point about formally checking reasoning is well taken!

We also discovered a formal device for ensuring the properties of an OHA, including particularly the desired formal proof of completeness of the safety requirement formulated at the highest refinement level (which we call Level 0) even in tricky analyses, namely the making of a Safety Assumption, which may be used in the proofs during the refinement that the high-level safety requirements are fulfilled (formally, it is introduced as a conjunct of the antecedent in the Refinement Safety Proof Obligation, RSPO, at each refinement level). The Safety Assumption does not come for free. It must be discharged at some point, either by being proved (from the safety requirements formulated at some more-detailed refinement level), or by subjecting its negation (i.e., that the Safety Assumption is not fulfilled) to a risk analysis to show explicitly that the risk of its not being fulfilled is acceptable. A sort of “buy now, pay later” for Hazan – but one must be careful to ensure that the interest rate is not unacceptably high! The procedural risk being that one may make a strong Safety Assumption early on to allow the analysis to proceed smoothly – and then take it all the way through, to leave some other poor analyst with an extremely difficult or impossible job of discharging it! That obviously won’t do in polite society, therefore use with care!

Jan should be writing this, but he’s off at his daily job of saving our TechFak computer network from the predations of its users, so it’s left to me.

I had forgotten how much joy there is to be had in working this kind of stuff in a rich discussion environment! It’s been a fun week! Thanks everyone!

PBL



Simulators and Veridicality in Airline Training and Pilot Currency Checks

9 09 2010

In his note in RISKS-26.15, Peter Wayner refers to the article Simulator training flaws tied to airline crashes in USA Today, 31 August 2010 (WWW version), which claims to have shown that «Flaws in flight simulator training helped trigger some of the worst airline accidents in the past decade» and that «More than half of the 522 fatalities in U.S. airline accidents since 2000 have been linked to problems with simulators».

I like to think I keep well up to date with commercial aircraft accidents, their analyses and causes, and am aware of simulator strengths and weaknesses. This suggestion struck me as somewhat thin. But if one reads the sentences literally, with their main verbs “helped trigger” and “have been linked to“, they do not speak of causes or causal factors. I can “help trigger” an accident if some USA Today journalist is so enraged by reading this note on hisher Blackberry that heshe runs a red light. And I can link USA Today with whom I wish simply by mentioning them in the same sentence in a Risks note. I am sure the newspaper intends stronger links than this, but it would be good to know what and how, and the article gives no clue. The NTSB uses the words “probable cause” and “contributing factors” in their conclusions and these terms have more precise meanings.

The article mentions three accidents: the November 12, 2001 American Airlines Airbus A300-600 loss of control on climb-out from New York; the December 20, 2008 Continental Airlines Boeing 737-500 takeoff loss of directional control at Denver; andthe February 12, 2009 Colgan Air Bombardier Q400 loss of control on approach to landing at Buffalo. The abstracts and links to the full reports are to be found on the NTSB WWW site as, respectively, DCA02MA001, NTSB Abstract AAR-10/04 and NTSB Abstract AAR-10/01. I invite readers to take a quick look at these very short synopses. These three accidents total 315 deaths and the USA Today article does not say which other accidents it counts.

Only the Denver accident causes and factors specifically mention simulators. The pilot flying lost directional control of the aircraft on the runway during takeoff, because of very high gusting crosswinds. The gust “exceeded the captain’s training and experience”, and according to the NTSB he failed effectively to use rudder to control the aircraft in the gust. The first contributing factor allows us to conclude that the crew did not receive timely and accurate information on the actual wind strength and direction. The second contributing factor is “inadequate crosswind training in the airline industry due to deficient simulator wind gust modeling“.

It is widely accepted in the industry that the most recurrent feature of most large-airplane commercial air accidents worldwide in the last few years has been loss of control. It used to be controlled flight into terrain, but it is now widely accepted that the Ground Proximity Warning System (GPWS) and its version Enhanced by terrain mapping using GPS and terrain maps (EGPWS) have reduced the incidence of such accidents considerably (although they still occur, as to an Airblue Airbus A321 on approach to Islamabad on 28 July, 2010 – see the Aviation Safety Net brief report).

The 2001 American Airlines accident was loss of control because of structural failure: the vertical fin separated from the aircraft. The NTSB found that the pilot flying had caused that separation by overstressing it through “rudder reversal” control inputs; contributing were the rudder control system design of Airbus, and American Airlines Advanced Aircraft Maneuvering [sic] Program AAMP. The NTSB heard both that AAMP discussed use of rudder to help recover from upsets, and that the FAA, Airbus and Boeing had expressed concern about this in a letter to American Airlines four years before. The pilot flying had been observed on a previous flight using rudder to control unwanted aircraft movement from environmental disturbance, and the captain on that flight, who gave evidence to the inquiry, had discussed it with him then. I refer Risks readers interested in more to the report, as well as to my paper The Crash of AA587: A Guide. The AAMP does involve simulator work, but a simulator cannot be known accurately to represent what would happen during unusual piloting rudder-reversal behavior because, well, until the accident nobody knew at what point airframe structure would fail (it turned out to be some one-third stronger than required by certification regulations)!

The pilot flying the Colgan Air accident aircraft reacted inappropriately to a stall warning, by pulling on the stick, and holding it back against the attempts of the automatic “stick pusher” system to push it forward. This resulted in the aircraft stalling at low altitude. Pushing the stick forward is the appropriate response. There was considerable discussion of the pilot’s aptitude, his level of awareness (relating to possible fatigue), and his overall Q400 training at Colgan Air. The NTSB remarked on features of that airline’s training program, which of course involves simulator work. But I don’t think it would be appropriate to conclude that there is anything much wrong with the simulators themselves.

Simulators do not necessarily accurately represent the behavior of aircraft close to the “edge” of their “flight envelope”, and they cannot be taken to do so for flight outside the envelope. Aerodynamicists study these “out of envelope” characteristics by use of wind tunnel models, but actual aircraft are not flown in flight test “out of envelope” except for certain restricted manoeuvres prescribed in the certification regulations (such as flying at “maximum operating airspeed” and initiating a 7.5° nose-down dive for 20 seconds, to mimic an overspeed excursion from cruise). For most “out of envelope” flight, aerodynamicists can make very well-educated guesses (from their wind-tunnel modelling) as to what might happen, but they are the first people to say that they are not at all certain. Nobody goes out to flight-test Boeing 747 aircraft in partially-inverted almost-vertical semi-spins, such as what happened to a China Air Lines Boeing 747 over the Pacific near San Francisco in 1985 (see the digitised version of the NTSB accident report in the entry in our Compendium. Incidentally, the human factors chair on this investigation tells me this was a watershed event for the investigation of human biorhythms and possible fatigue as potential contributors to accidents).

So there are limits to what simulators can achieve, and it is a matter for research how much “out of envelope” behavior can be usefully and veridically simulated. Since loss of control is now prominent amongst probable causal factors of accidents, it seems to me obviously worthwhile to perform this research. Where it will lead is anybody’s guess, as with most research. However, the NTSB’s concern in the Denver report is with situations that could be veridically modelled in flight simulators but currently are not. That could be, and probably should be, fixed.



Fully-Automatic Execution of Critical Manoeuvres in Airline Flying

3 09 2010

David Learmount’s semi-annual review of commercial air accidents has just appeared in Flight International (3-9 August, p34). There were three accidents to high-performance large commercial passenger jets: (1) a Ethiopian Airways Boeing 737-800 took off from Beirut over the sea at night and ended up in the ocean (25 January); (2) an Afriqiyah Airways Airbus A330-200 impacted the ground violently on approach to Tripoli’s RWY 9 (12 May); (3) an Air India Express Boeing 737-800 overran the runway at Mangalore (22 May). Recently, not included in David’s survey, (4) an Airblue Airbus A321 impacted high terrain while on approach to Islamabad (28 July); (5) an AIRES Colombia Boeing 737 landed and broke up on RWY 6 of San Andres Island (16 August); (6) an Embraer 190 of Henan Airlines impacted short of the runway and broke up on approach to Yichun (28 August).

Taking a random six months of accidents is not a sample conducive to pointing to trends using statistical methods; it is well-known amongst students of commercial air accidents that there are “fashions”, common features which cluster at a certain time, but which then reduce, without anybody necessarily doing anything much different. However, let us start here with the question that is the theme of this note:

Which of these accidents would likely have been avoided had the aircraft been fully automatically controlled?

Unmanned aircraft such as the military Global Hawk reconnaisance aircraft routinely fly complete missions under automatic control, from full stop to full stop. Other unmanned aircraft, such as the Predator «drones» used by the US Military in Afghanistan, and for US southern-border patrol, are remotely piloted, but have had control problems with the remote-piloting regime, as for example in this analysis of a US southern-border accident by Johnson and Shea. I want to emphasise here that we are indeed in the era in which fully automated long-distance flights are routinely flown (if only at present by the US Military, and, soon, other NATO allies with the Euro Hawk).

(1)Ethiopian had taken off into a «black hole» over the ocean at night, in other words into an environment in which there were no outside visual references whatsoever. The aircraft was performing a climbing turn, when it started to descend and disappeared from radar. There were electrical storms in the vicinity. The causes are not yet known, but certain factors have been proposed as hypotheses. The accident is almost certainly loss of control (LOC): no one presumes that the pilots committed suicide/murder. First, spatial disorientation of the pilots. This is a historical factor in the records of accidents in night takeoffs and landings in «black holes», such as over oceans. Second, a weather-related upset, say windshear of some kind causing loss of control (LOC). Such phenomena are also known historical factors. It is understood that no technical defects have been yet identified, but I also understand that the investigation is not yet complete.

If spatial disorientation of the pilots had been a causal factor, this would have been avoided by full automatic control of the takeoff and after-takeoff manoeuvring

(2)Afriqiyah was approaching RWY 9 at Tripoli, in clear weather but with reported «low, hazy visibility» (Learmount, op. cit.). «Information from the FDR and CVR indicates that there were no technical faults on the aircraft and fuel starvation was not an issue» (Learmount, op. cit.). Aviation Herald confirms this in its report, see in particular the update from the investigator’s information on 14 August. It impacted the ground heavily (even violently), some vertical distance below the approach path, indicating a high rate of descent. The impact was about 900m from the runway, according to Aviation Safety net’s report. The ground in the area of the airport is more or less flat. Although the VOR was NOTAMed unreliable, there is an NDB approach to RWY 9. The aircraft is capable through GPS equipment and NDB reference of constructing a «Continuous Descent Approach» (CDA) path, which gives a more-or-less constant rate or angle of descent to the point of touchdown, constructed by the Flight Management Systems using the exterior navigation aids, and it would have been able to do that at this airport at this time, as far as is now known. If the aircraft had been on a CDA, it would have been at about 200 feet altitude at this point (the arithmetic: assuming 3° approach path, about one-in-twenty, and a touchdown point 300m from the runway threshold, the aircraft impacted about 1200m from touchdown point, at which point it should have been at 60m above touchdown zone elevation (TDZE)).

Automatics are capable of controlling the airplane within a tens of feet of a given path, and routinely do so (indeed, they must do so in certain flight phases, such as cruise in european RVSM airspace). Given that there were no technical issues identified with the aircraft by the investigation, and violent weather was not a factor, a fully-automated CDA would have landed the aircraft on the runway; at least ensured it was not 300 ft below where it should have been assuming a normal 3° continuous-descent approach path.

(3)Apparently the Air India Express Boeing 737 «landed on RWY 24 just beyond the touchdown zone, in fair weather with no rain. It overran the runway end and plunged into a ravine (Learmount, op. cit.). According to the report by Aviation Herald, the runway has an ILS, required landing distance was 7500 ft and the runway length was 8100ft. There is no word yet, to my knowledge, on possible causal factors.

This seems to have been a routine landing, with no compromising weather. Such landings are routinely accomplished fully automatically, by the Hawk UAVs.

(4)The Airblue A321 had completed an ILS approach to RWY 30 at Islamabad, had turned right at low altitude and then left, to fly parallel to the runway. The crew is supposed at this point by many (with whom I currently concur, given the information available) to have been attempting a circle-to-land (CTL) manoeuvre, likely to land on RWY 12 (the reciprocal of the approach runway). CTL is a routine instrument flight rules manoeuvre, permitted from the ILS approach to RWY 30 as shown in this snippet from an approach plate, posted by «aterpster» in the PPRuNe discussion forum. In a CTL manoeuvre, the pilot, upon «obtaining a visual with» (i.e., seeing) essential parts of the runway or its environment, manoeuvres to land the airplane, provided the visual contact is continually maintained. If visual contact is lost, a routine «missed approach» manoeuvre must be immediately initiated. During the manoeuvre, the airplane must be flown within a given radius, just over 5 nautical miles, of a specified point on the airport. A diagram of this circling radius, overlaid on a plan of the airport and environment, appears in this post by «aterpster» in the PPRuNe discussion forum. A first approximation to the crash sight by, overlaid on a map with some of the navigation detail, including the CTL radius from a post by «aterpster» may be seen in this post by «PJ2», who updated his estimate of the approximate crash location some time later in this post. The crash site is reported by Aviation Herald to be about 10 nautical miles away, and in this early article in FlightGlobal, the WWW site of Flight International, to be 9.66 nm. The print version of the article (Flight International, 3-9 August 2010, p7) says 9.7 nm. There were reported to have been «no technical problems» in a later article in Flightglobal. So the impact site was at about twice the allowable CTL radius. The CTL radius encloses only flat land; the aircraft impacted «rising terrain», in other words a hill/mountain range nearby, but not so nearby as to constitute any danger to normal IFR operations.

There is a question, currently unanswered, as to why the EGWPS terrain-warning equipment did not enable the crew to manoeuvre to avoid the terrain.

Unlike a (presumed-)straightforward approach as at Tripoli, current commercial-aircraft automatics do not assist CTL manoeuvring in any reliable manner; the procedure should be hand-flown. However, it is a straightforward manoeuvre well within the capabilities of automatic control systems such as those on the Global Hawk to follow an ILS, and circle to land on the reciprocal runway, within the given limits. Automatics could have accomplished this manoeuvre within going outside the given CTL radius and therefore without a danger of impacting high terrain.

Furthermore, systems currently in test for the USAF, and shortly to become operational, perform automatic terrain-avoidance manoeuvres, even – expecially – during the kind of low-level manoeuvring performed by military pilots. The system is called Auto-GCAS and was extensively reported and flight-tested recently by Aviation Week (August 2, 2010, pp50-57). Here is a short blog on it by Stephen Trimble of FlightGlobal from last year.

Some proponents of EGWPS have suggested that avoidance manoeuvres in commercial air operations be automatically initiated and flown. This is well within current capability, as shown by Auto-GCAS.

(I have mentioned anonymous writers above. Here is what I know of them. “PJ2″ is someone I know, and with whom I have discussed accidents for a decade. He is a recently-retired captain for a major airline, where he was deeply involved in setting up the airline’s FOQA program. He is expert in aviation safety matters and I value his advice considerably. I do not know “aterpster”, but have read many public contributions by him. He self-indentifies as a former airline pilot who has been officially involved in accident investigations as a designated representative of pilots’ organisations.)

(5)Initial reports of the AIRES accident suggest that the aircraft landed short, for example this report in Aviation Herald. Weather is reported by FlightGlobal to have included thunderstorms in the vicinity. Some commentators on the PPRuNe thread have suggested that the main gear was torn off upon reaching the runway hard surface, which is elevated slightly above the surrounding terrain (one imagines the wheels sinking into software ground before the runway, and then impacting the hard runway construction).

It is not possible at this point to estimate the causal influence of the weather – one notes in the above references that the aircraft was reported to have sustained a lightning strike on final approach. But a landing of this sort to the TDZ is routine, even in stormy weather, for digital flight control systems. Providing, of course, they are sufficiently well insulated from the effects of a lightning strike.

(6)The Henan incident was also a landing-short, in reportedly benign weather – see for example the report in Aviation Herald – on a non-precision approach (NPA). The weather was reported as «foggy», but of course fog is incompatible with the kinds of atmospheric disturbances which might lead to control problems, and is not an issue for automatic control. A fully automatic landing was possible in these conditions, but not necessarily in the E190 accident airplane.

At this point, there is no public information about any technical problems with the flight. NPAs have been known for decades to be more accident-prone than precision approaches (ILS), but modern automation such as on the Embraer 190 can routinely perform CDAs, as discussed above with respect to the Afriqiyah accident.

None of the final reports are out, or expected yet, for any of these accidents. As things stand at present, the Ethiopian and ARIES accidents could have had the causal involvement of atmospheric disturbance, we don’t know. But other potential causal factors would have been mitigated if the manoeuvres had been performed fully automatically. In the case of the other four accidents, it seems quite reasonable to assert that, had the manoeuvres been performed fully automatically, outside the current capabilities of commercial-aircraft avionics but certainly within the routine capabilities demonstrated by Global Hawk UAVs, and the USAFs Auto-GCAS.

There are of course substantial safety issues with fully-automatic flight in civil airspace. It is correct to say that at this point it is not operationally feasible. For a recent review of some issues, see the forthcoming paper Computational Concerns About Integration….. by Johnson, to be read in two weeks at the SAFECOMP conference in Vienna.

So no one is yet suggesting, even for the medium term, pervasive fully-automatic commercial air transportation. But in light of the observations above concerning the six 2010 fatal accidents to large commercial jet aircraft, it does look as if it would be worthwhile to research whether standard approach and landing manoeuvres could be transitioned to routine fully-automatic execution.



The Internet as an Educational Tool

1 09 2010

Time was, we thought that people, students, who wanted answers to questions, could come to our office hours, ask, and be answered.

Then we thought that these people could pose these questions to bulletin boards and forums on the Internet, and get answers from all sorts of people, answers which were at least as good as, and maybe even better than, what they could get from us in our office hours.

How wrong we were! For an example of what happens when someone like me attempts to answer a question as if it were posed as a technical question to me during my office hours, see this thread on PPRuNe.

For background, “BOAC” is an experienced, wise, and mostly thoughtful pilot flew Lightnings for the RAF (a wonderful and singular machine, indeed the only aircraft which demonstrated it could outperform Concorde) and most recently Boeing 737 machines. “Pugilistic Animus” is someone who in my estimation has at least a graduate’s grasp of aerodynamics, and likely more – hard for me to tell (but he could, if he chose).

In traditional educational circles such as I have experienced since the 1970′s, the questioner would have posed the question, it would be answered as per my reply, and everyone would have gone back relatively satisfied to whatever they were doing. Handling the question via the forum, and parrying the denigrations so that the questioner, if heshe was still reading, could be more or less satisfied that the original answer was trustworthy, seems to have taken me at least four times as long, and who knows what the questioner makes of the interactions.

So, what conclusion do you, the reader, draw for the future of education via open Internet discussion forums? Please let me know, for I would dearly like it to work somehow, but this example does not give me hope.

PBL



Malware and the August 2008 Madrid Spanair Take-Off Accident

27 08 2010

On 20 August 2008, a MD-82 aircraft of the airline Spanair crashed on takeoff (TO) from Madrid-Barajas airport. The high-lift devices on the wing had not been properly configured to give the necessary lift on takeoff, and the aircraft was unable properly to lift off as planned. See Aviation Safety Net’s report of this accident for more details.

There had been a maintenance issue during a previous attempt at departure, and maintenance personnel had addressed this issue. In effecting the repair, however, the takeoff configuration warning horn, which aurally warns the crew that the high-lift devices are not appropriately configured for takeoff, had also been disabled. The crew is required, in the pre-take-off check list which they have to perform, to check that the aircraft is appropriately configured for takeoff, and it seems that they did not do so at the second departure: they performed some of the items, but not the full list.

Spanair uses a ground-based computer to process aircraft logs for maintenance issues. The fault which caused the accident aircraft to return to the gate had apparently occurred more than once the previous day, and been logged. But the press has recently reported that malware in this computer delayed the processing of reports, and so maintenance was not aware of the problem the previous day, when they would have been able to correct it, before the fated flight. The Press reports have thereby connected this malware with the accident. See, for example, a summary in english of the reports by Daniel Johnson on the University of York Safety Critical Systems Mailing List.

Brian Reynolds commented on these reports that “This is totally bogus” and clarified that he meant that it is “totally bogus” “[t]hat a virus or Trojan in a ground maintenance computer is casually related to this incident.

Reynolds seems to be denying the claim that malware in a ground-based maintenance computer is causally related to the accident. But he omitted to say what his criterion for causal-relatedness is.

I have one: the concept of necessary causal factor, proposed in 1973 by the philosophical logician David Lewis, who credits the concept to David Hume (his “second definition” of cause). I took over Lewis’s semantics 15 years ago for use in failure analysis.

According to this semi-formal, objective notion of causal factor, there is demonstrably a chain of causal factors leading from the presence of the malware to the accident. According to this concept, Reynolds is provably wrong.

So now let me show this.

Here is the Counterfactual Test:

Let A and B be events or states.

A is a necessary causal factor in the occurrence of B just in case:

If A had not occurred, then B would not have occurred.

This last sentence is called a counterfactual (or contrary-to-fact) conditional. “Conditional” comes from the “if…then…” form; “Counterfactual” from the fact that A and B did as a matter of fact happen, and one is supposing what the world would then have been like had A not occurred. In order to determine this, I adapt the Lewis semantics: suppose A had not occurred, but the world stayed otherwise as similar as possible to the actual state of affairs that pertained. Did B occur in this possible state of affairs? Most often, we cannot answer with absolute certitude “yes” or “no”, but it turns out that we can most often answer “most likely, yes”, or “most likely, no”. The Counterfactual Test is to ask this question I just posed. If the answer is “most likely, yes”, the Counterfactual Test is “passed” and A is a necessary causal factor of B. If the answer is “most likely, no”, then A is not a necessary causal factor of B. We have found the Counterfactual Test to be very useful in complex engineering failure analyses.

To show a causal connection between the presence of malware on the maintenance computer and the accident, here are five instances to check with the Counterfactual Test:

1. Had the malware not been present, the fault causing the phenomenon would have been noted by maintenance personnel in a timely manner (let us say: at latest, end of the previous day).
2. Had the fault causing the phenomenon been noted by maintenance personnel in a timely manner, it would have been appropriately repaired before the accident flight.
3. Had the fault been appropriately repaired before the accident flight, the TO-configuration warning would have sounded on the accident flight.
4. Had the TO-config warning sounded during TO on the accident flight, the TO would have been aborted when the warning sounded and the aircraft properly configured before subsequent TO.
5. Had the TO been aborted when the warning sounded, the aircraft would not have crashed as it did.

I consider all of these counterfactuals to be true according to the Lewis semantics. It follows:

1a. The presence of the malware was a necessary causal factor in the lack of timely awareness of the fault.
2a. The lack of timely awareness of the fault is a necessary causal factor in lack of timely repair.
3a. The lack of timely repair is a necessary causal factor in the TO-config warning inhibition.
4a. The TO-config warning inhibition is a necessary causal factor in continuing TO to loss-of-control.
5a. Continuing TO to loss-of-control is a necessary causal factor in the accident.

So, there is a chain of six causal factors, chain-length five, connecting the presence of malware to the accident. QED.

I emphasise, just to avoid misunderstanding, that these are by no means the only causal factors relevant to the accident: that the crew failed adequately to perform the pre-takeoff check list on the accident flight is most certainly a necessary causal factor in the loss of control. The reader is invited to try out the Counterfactual Test to assure himherself of this.

Applying the Counterfactual Test rigorously throughout the list of potentially-relevant factors, to see which ones are indeed causally relevant and which not, is the core of our analysis method Why-Because Analysis (WBA). For those interested in seeing relatively quickly how we perform WBAs nowadays, there is available a case study on how to perform a WBA using the SERAS Reporter and SERAS Analyst tools. Here is some general info concerning our experience with Why-Because Analyses. Typically, depending on the level of detail provided by the investigation, a detailed causal analysis (which we represent in graphical form as a Why-Because Graph) ends up showing a hundred to a couple of hundred individual factors, of which a quarter to a third are “root-causal factors”, that is, causal factors which are not regarded as themselves having pertinent causes. So WBA also includes a fair amount of bookkeeping, or “complexity control”, or whatever one wants to call it. For example, given a WBG with a couple hundred items, one would assemble these causal factors into a small number of subgroups, and give these subgroups appropriate titles, to provide an “executive summary” of the analysis. The SERAS Reporter and SERAS Analyst software is available as freeware from Causalis Limited .

We can well expect a full WBA of the Spanair accident to contain between a hundred and a couple of hundred factors.