Software Quality and Fitness for Purpose

26 08 2011

Following on to my recent post on certification requirements for commercial aircraft, John Rushby and I have been discussed a paper of his, on commercial aircraft software and the guidelines DO178B, in the invited session on certification at EMSOFT 2011.

John is concerned with whether DO178B “works”, that is, leads to high-quality code which is fit for purpose, and, if so, why and how. I think that is a hard and important question and I commend his bravery in addressing it squarely (rather than hiding behind a blog format, as I do :-) ). I commend his paper to people when he publishes it – I imagine it will be on his publications page sometime in October 2011. The paper is not long, but it is dense. Rather as if it were a poem, I had to read it multiple times, carefully. It took me a week to respond. I concluded I do really prefer Goethe, but then he didn’t talk about avionics SW.

John suggests that DO178B is focused on assuring the correctness of the executable code. I found that surprising; I think both Martyn Thomas and I are concerned that DO178B is focused largely on processes which people thought and still think (often, we claim, without much scientific evidence) correlate with code that is fit for purpose.

I use the term “fit for purpose” as does Martyn. John suggests that is a British term not used in the US. I prefer to use it to using the term “correct”, so let me translate. There are at least two ways in which code may be said to be “correct”. One is that it fulfils its specification; let us call this correct-1. The second is that the code causes the system to behave in a manner appropriate for the task at hand; let us call this correct-2. I call the correctness-1 properties of code its quality, and correctness-2 properties fitness for purpose. But let me say correct-1 and correct-2 for the remainder of this note.

The specification may not subsume the “task at hand” in all cases: quality, correct-1, does not imply fitness for purpose, correct-2. Indeed, there is good evidence dating back some twenty years now, say, in work of Robyn Lutz published in 1993 , that most failures of correctness-2 in moderate to complex systems are not failures of correctness-1.

Martyn and I would agree on the lack of evidence that “clean and tidy” SW development processes entail any concrete property of the resulting code (such as fitness for purpose). Martyn would also suggest, citing work of Andy German of QinetiQ, that higher DALs in DO178B do not necessarily correlate with higher-quality software. Here is what Martyn says, information from personal communication with Andy:


Here is some data from the formal analysis of the avionics software of a current American military aircraft that was certified against DO-178B Levels A and B for use in civil airspace.

The following defects were among those reported in the software after certification:

Erroneous signal de-activation.
Data not sent or lost
Inadequate defensive programming with respected to untrusted input data
Warnings not sent
Display of misleading data
Stale values inconsistently treated
Undefined array, local data and output parameters
Incorrect data message formats
Ambiguous variable process update
Incorrect initialisation of variables
Inadequate RAM test
Indefinite timeouts after test failure
RAM corruption
Timing issues – system runs backwards
Process does not disengage when required
Switches not operated when required
System does not close down after failure
Safety check not conducted within a suitable time frame
Use of exception handling and continuous resets
Invalid aircraft transition states used
Incorrect aircraft direction data
Incorrect Magic numbers used
Reliance on a single bit to prevent erroneous operation

The worst module had a defect density greater than 1 defect in 10 lines of code. the best had 1 defect in 250 lines.

The problem, as I see it, is that most software assurance relies on testing, although we have known for at least 40 years that testing can only show the presence of errors and not their absence. Until software assurance is mostly based on mathematically rigorous analysis of the software (which can be done at no increase in cost if the software is developed with this in mind) these unacceptable rates of software defects will continue.

Notice that these are errors in the sense of correctness-1.

I hold it to be significant that a number of these errors could not have occurred had the SW been written in a strongly-typed language and the compiler correctly implemented the strong typing. This has been an issue for forty years, ever since the Algol project gave up. This is one of the demonstrably best-known ways to avoid certain well-defined classes of program error. If DO178B is truly focused on the correctness of the implemented code, why doesn’t it require development in strongly-typed source, and use of a demonstrated type-correct compiler?

Andy also claimed in a paper in Crosstalk 16(11), the Journal of Defence Software Engineering, November 2003 that “no significant difference” had been found with respect to levels of correctness-1 in code developed according to DO178A Design Assurance Level A and Level B. Development according to DAL A is regarded as significantly more resource-intensive that development to DAL B, in part because DAL A requires so-called MC/DC testing (see the helpful tutorial by Kelly Hayhurst, pointed out to me by Mike Holloway), which is quite hard work.

BTW, that edition of Crosstalk also includes a fine article on the so-called Ravenscar profile, for interprocess communication in Ada which admits straightforward static analysis, by Brian Dobbing and Alan Burns.

The big UK certification effort on which I understand much of the QinetiQ work was performed was for an aircraft manufactured by Lockheed Martin (BTW, one barely calls them “manufacturers” any more, but rather “system integrators”). John said by way of anecdote that he had indications that internal data of both Boeing and Airbus show “more issues” with DAL B software than with DAL A software.

Assuming these observations are correct, the question here would be how two experienced companies develop quality-improved software using the extra requirements for DAL A, but a third experienced company does not see any improvement.

The answer must be that there are hidden factors at work, factors which actually do lead to an improvement in SW quality, which in two companies are associated with the extra effort required for DAL A development, but which in a third company are not.

Since DO178B misses those factors (for otherwise all companies would show improved quality in DAL A development over DAL B development), isn’t it important to find out what they are, and then write them explicitly into DO178C, which is currently in its final stages?

BTW, if you want my view on what an ideal SW safety standard should say (thank you for asking :-) ), check out slide 22 of my Ada Connection keynote talk of 21 June 2011.

I might point out that it is much shorter than the 150pp of IEC 61508 Part 3 Version 2, which I mutter about on the previous slide.



Coda, Interdisciplinary Work, and Scientific Publishing

15 08 2011

It sounds like a mish-mash, doesn’t it? will probably read like a mish-mash, too.

Because true interdisciplinary work always looks that way, I think. That is one of the main points I wish to get across. But first, let me get there.

Concerning my last post, Leslie noted that the condition he labels “FAA requirement” in his slide 4, for 10-10 probability of failure per hour was actually a NASA requirement for the SIFT research. SIFT was the first digital flight control computer, and SRI was supposed formally to verify its operating system. The project didn’t succeed in this original goal, over a decade but, as is often the case, we computer people learned far more, and more fruitfully, from this failure, than we ever would have had it “succeeded”. For example, I am not aware of any formal proof that such-and-such a non-trivial system S is guaranteed free of Byzantine failures, for any system S that is not artificially constructed just for the proof. And that’s thirty years after the papers were published! Conclusions: Lamport and co put their fingers on some things that we just can’t do. Not only that, but they classified a cross-disciplinary problem in a new way. Byzantine failures, as spoken of by Driscoll et al., are a system problem, a mixture of phenomena which have to do with the electronic design, as well as the materials, of which system components are made. Transistors get cracks in them and turn into condensers (a Space Shuttle Byzantine agreement problem). But Lamport et al. turned their efforts to a pure algorithmic problem and published in pure computer-science journals (indeed, the best). Leslie is not a computer scientist who deals with avionics, he is a computer scientist who deals with computer science.

But right on the boundaries also. One of his most insightful (and to my mind, one of the best) pieces of work he ever did was on the collection of issues about arbitration in converting continuous (“analog”) data into discrete (“digital”) data: Buridan’s Principle, whose purely technical contribution rests on a mathematical theorem he proved with Dick Palais, his thesis advisor. You can read Leslie’s account of the odd results of his attempts to publish. He gave it to me sometime in the 1980′s. But since the 1990′s, everyone can know about it and read it at will, because he put it on the WWW. Thank heavens for the WWW!

And that is a point about interdisciplinary work with which I have been struggling now for almost twenty years. One writes a paper on the causal analysis of a computer-related aircraft accident using the Lewis semantics (the Counterfactual Test). One sends it to a computer science journal. Review: “that’s got aeronautics in it, no one in computer science understands aeronautics, better to try an aeronautics journal”. One does. Review: “that’s got logic in it, no one in aeronautics understands logic, better to try a logic journal”. One is not stupid, but if one were, one might try to do so. Anticipated review: “that’s got computer science and aeronautics in it, no one in logic understands computer science and aeronautics, better to try a computer-science-and aeronautics journal.”

And that’s all true and that’s all reasonable. Indeed no one in computer science reads aeronautics journals. No one in aeronautics reads logic journals, and so on. That’s why many engineers working on avionics bus systems still do not know about Byzantine failures, 30 years on.

The result is that most of what I write gets on the WWW and stays there. One can spend one’s time writing, or chasing one’s tail around such publishing conventions, but doing one takes time and effort away from the other, and I prefer writing.

Just to give an indication, one of the pieces of work I performed in the last year of which I am most proud is the analysis of causal explanations of the Concorde accident and assessments of responsibility, which I wrote about in my post Concorde, Ten Years On, Part 2. I see there a series of pressing social and technical issues and their interplay, which people have not satisfactorily come to grips with and I regard that piece as some kind of a start. As I said I’m proud of it. One can’t do that kind of work every day, or at least I can’t. One has to sieze the moment and I did. Actually, that is the way many successful researchers work in math or computer science. Or philosophy, for that matter. You spend most of your time laying some kind of groundwork as best you can, and then you are somehow handed a moment and you sieze it: “I can do that!” and you do. Some more than others.

This wasn’t ever different. Disciplines were partitioned, especially academic disciplines. But one would have thought, as I guessed 15 years ago, that the WWW would make everything different. Mais, plus ça change, plus c’est la même chose.

Some more examples.

I recently organised a Workshop on the Fukushima nuclear accident, inviting largely sociologists and computer-system-safety people. People who read my blog know why I laud the sociologists for their insights into technical matters. When I was thinking we could do this, I asked people about funding. The Scientific Board of CITEC, where I am a PI until November 2012, thought it was a cracking good idea and very relevant and offered financial support. My colleagues at the Centre for Software Reliability in Newcastle upon Tyne, when I called them to apologise that we were withdrawing from their exhibition at the Ada Connection in order to put the money in the Workshop, also offered financial support. Thank you all! And I did approach the German central funding agency for scientific research, the DFG, which had circulated an e-mail saying that in the wake of the tsunami there were instruments available to support cooperative research on the matter in the very short term.

Naive as I am, I took this message literally. I contacted the responsible administrator whose address was on the note. He graciously explained that his “instruments” were limited and didn’t support my workshop idea, and passed my request on to, amongst others, the administrator responsible for the support of engineering research, who replied forthwith in one sentence: “from the point of view of engineering, I don’t see any possibility of support” What? The world has just experienced one of the two most devastating engineering accidents ever, German politicians scrambled over each other to devise our exit from nuclear power, and the prestigious German academic research support agency says it ain’t interested? I put in a carefully worded query asking whether this could really be so, and received no reply.

Now, me being me, I would think they should be ashamed of themselves. But if I said that, I’m sure it would be indicated to me how inappropriate that would be, and really that I don’t understand the formal courtesy structures at play, and so on. Maybe all true. But the fact is that I have an international reputation in accident research, here was a biggie with major political consequences, I invited a bunch of top people to discuss it, they all said yes by return e-mail, and the engineering research support organisation said it wasn’t interested. There is no way around that fact, no matter how pretty the words.

And that illustrates something that I feel is going more and more wrong with academic-type research over the years in which I have been involved with it. I suspect it is particularly acute in Germany. Academic research here after the first degree is performed by scientific employees, by people in temporary jobs. There are no “graduate students” (although that is beginning to change: there are now narrowly-defined graduate colleges which offer competitive scholarships. We have one in Bioinformatics and Genome Research, another in Situated Communication, which I think is now over, and another in Cognitive Interaction Technology, which I think has absorbed it). You want to offer a research topic in an American university, you do so, to all the graduate students, and some one will be attracted to it and come and talk to you. In Germany, you have to apply for funding (mostly from the DFG) for a temporary position to perform the research, then wait for the job applications, and hire someone on the basis of an interview. It’s a lot more work for the faculty member; there isn’t the same personal connection to the bright young people you already know are capable of the work; it’s less flexible (I got three quarters of the way through two other thesis topics before I hit on the one I could finish, and none of them were connected with each other. You can’t do that in a job. Indeed, it took me three jobs!); and I believe the quality of the product suffers (but then, I was at Berkeley. Unfair comparison? Well, no. No German university makes the top fifty in any of the more well-known rankings and I’m talking about possible reasons for that).

Let me amplify a little on that parenthetical comment. I had a colleague here in Bielefeld with over twenty or twenty-five “scientific assistants” in his group, people working at temporary jobs who hoped thereby to get their doctoral degree. At Berkeley, people, even Turing-award winners, had at most four or five doctoral candidates whom they supervised. The key word here is “supervised”. No one person can supervise twenty-five doctoral candidates to anything like the Berkeley norm. Indeed, supervision, such as it was, was mostly delegated to the post-docs. Of which, to achieve the same ratio, one would need five or six or seven (I recall there were three or four). And these doctoral-work supervisors were not Turing-award and like winners, not even NSF Young-Investigator Award winners, such as at Berkeley. They were people who had got their first research qualification and were mostly at the beginning of seeing whether they could make any kind of name for themselves.

A couple of years after I got to Bielefeld, I discovered that somebody in that group had just written a doctoral thesis on temporal reasoning for artifical agents. Temporal reasoning for artificial agents? That’s the very work that I was known for, partly on the basis of which I was hired (here, one is not hired but “called”). This guy had never talked to me. Curious, I looked at the work. After I read the statement of the problem, it was obvious how to solve it. Then I looked at his solution and it wasn’t anywhere near as good. (But there was some program code behind it.) Happens here. Happens quite a lot here. Doesn’t happen at Berkeley, by and large.

I faulted the research structure. The guy had a job, with a job description. He was a nice, friendly and capable guy. At the end of the job was the expectation of a doctoral degree. Which was duly awarded after satisfying the appropriate formal criteria. All very neat and clean. DFG money apparently well spent. But the sum total to the world’s knowledge of how to solve temporal reasoning problems with artificial agents was essentially nil. His energies, and the funding support, would surely have been better spent had he talked to me, and then worked on a problem of the same level of difficulty, but to which the solution was not known.

This is already a lot of anecdotes. But it is hard to see how to get at the point without recounting lots of anecdotes. For each anecdote has its individual answer: it’s a special case; or I misinterpreted; or I was sour at someone; or I’m just being arrogant; or I’m looking for excuses for something I haven’t done or don’t do. Maybe all true, but it is the number of anecdotes, interpreted as the weight of evidence, that persuades reasonable people that there is something to the set-up which encourages all this.

Indeed, I am convinced that the model in which aspiring researchers pick their own topic from amongst those offered, make personal connections with a senior researcher who is able to judge whether they might be capable of completing the work, encouraged indeed required to correspond with more accomplished others who have worked on and solved similar issues, along with the freedom to change topic completely when the current one won’t work out, is a better way to induce productive research than the research-as-job model.

But this heavy structurally-constrained interpretation of what constitutes effective research goes much further. Recall my anecdote about DFG support for my workshop, above. Along those lines, consider the following. I am a Principal Investigation in CITEC (above), whose charter is coming up for renewal and the proposal is about to be submitted. It turns out that the business of saying what my group (essentially of two: me and my post doc) have accomplished and what we will accomplish in the next five-year period was delegated to a young colleague, whose job is supported through the institute through the five-year cycle, as indeed now are all professorial jobs in Germany (tenure has gone) and is thus dependent on the success of the upcoming proposal.

Despite offers to help, my colleague didn’t talk to me at all. Indeed, it took me a certain amount of effort to find out who was writing what about our work in the proposal, since apparently none of the stuff I wrote was going to make it in. He wrote one sentence about the work my group had accomplished over the four years (with apologies that he couldn’t find more). And he found no relevant publications, despite (he indicates) trawling our publications page. Well, during the course of the last few months I have been asked variously for one key publication; for five key publications; for ten publications not necessarily within the CITEC remit, all by various people none of whom are he. The Coordinator of CITEC (effectively the director) asked for a meeting, to explain to me that without any publications it didn’t look good for the proposal to include me.

What? People can’t find stuff I’ve written on the safety of mobile automatic devices in the last five years? Well, of course they can, but you see it doesn’t count. The DFG says peer-reviewed journal articles only.

There we go again. Structural constraints. Nobel-memorial-prize-winning economists, and sociologists, and political scientists, and legal scholars all write blogs. Hundreds and thousands of people read them and comment, including their peers, often in their blogs. Peer-reviewed? Most obviously! I just received a copy of a journal article (counts!) written by two colleagues about two essays (cited) I wrote in this blog. Other colleagues read my posts and they comment!

Another example. We started a mailing list in March 2011 on the Fukushima accident and I recently summarised my contributions, which amounted to 117 A4 pages in 12-pt type, for the workshop proceedings. Now, every word I wrote on that mailing list has been read by eminent colleagues on the list, and they have commented, frankly (it is a closed list). And I have commented on their writing in turn. That’s what you do on such a list, if you’re one of the people who do it (others prefer just to read). Peer review? How much more is it possible to get? And more easily?

The WWW has been pervasive for fifteen years and e-mailing lists for thirty. And there is still no measure of quality of contributions that is acceptable to the German research funding agency? (It is not the only one with such a view.) Astounding! It is not as if this is a hard problem. It would get to be a hard problem if what you want to try to define is The Definitive Measure of Scientific Quality, because there can’t be one. But judging the quality of blog posts or sustained mailing-list contributions is no easier or harder a job than judging the quality of peer-reviewed journal publications, indeed it’s often easier because you can ask more people.

Actually, what happened with the CITEC thing is this. Bernd, Jan and I figured a while ago that our textbook on Safety of Computer-Based Systems, which was been solicited by a major publisher some years ago, would be written and out by now. And we thought one book would likely suffice to show what we’d done. One book is not five published papers; in this case it’s more like fifteen, and there will be more. But it isn’t out. Since it is a text, we need to be sure that the techniques introduced can actually be used by the target audience, students and engineers, and so some of our contributors belong to that target audience (not all textbook writers do this, but I happen to think it’s a very good idea). They are not necessarily as experienced writers as I am, so it simply takes longer than we’d thought. Quite understandable, one would have thought. But apparently there is no reasonable way to say to the DFG that the book is almost finished. (Someone might even want to say that a textbook isn’t research. But this one is, you know, just like Nancy Leveson’s new text. Read that one too!)

Structural constraints, and how they hinder effective support of effective research. Is everyone convinced by now? At least convinced to look at the issue more closely? Shall I stop here then?

Not quite. One more word, back to the original topic. “Interdisciplinary” is one of the buzz words of the new modes of research support. But the problems indicated above of support, publication and assessment of work which crosses traditional discipline boundaries, or the new boundaries left in place by a country’s Scientific Wise Owls and Funding Agencies, are deeper than a buzz word, or even than a buzz concept. The logicians can’t read aeronautics and the aeronauticists can’t read formal logic and the computer scientists don’t understand aerodynamics and the engineers don’t understand the sociologists and I doubt that is going to change rapidly under the hierarchically-directed research-as-job model, buzz word or no.



Certification Requirements for Commercial Airplanes

14 08 2011

I was browsing the invited lectures given under Martin Abadi’s College de France lecture series and came across this elegant, simple explanation of so-called Byzantine failures by the gentleman who invented the term, Leslie Lamport. Leslie’s two papers on the subject with Rob Shostak and Marshall Pease in the early 1980′s, Reaching Agreement in the Presence of Faults and The Byzantine Generals Problem, are seminal. Kevin Driscoll et al.’s SAFECOMP 2003 paper, Byzantine Fault Tolerance: From Theory to Reality, as well as Kevin’s brilliant keynote talk at SAFECOMP 2010, Murphy Was an Optimist (of which the slides seem no longer to be on the WWW) shows how prescient the SRI work was.

I met Leslie at SRI in 1984. Rob had just left, to finish and then sell his PC database SW “Paradox” with Richard Schwarz, starting his second career as a serial entrepreneur. A colleague commented at the time that the market for PC database software seemed already to be saturated, so leaving a good job for that was risky. I guess that’s how some make millions and some don’t! Marshall was still there, was reputed to be quite a successful stock purchaser, but is no longer with us.

Leslie’s Slide 2 shows what appears to be an Airbus A380, computers of some sort issuing pitch control commands (probably primary pitch control; Byzantine failures in the FMGEC software, which includes the autopilot, would not likely be safety-critical). And Slide 4 speaks of an “FAA requirement” that the “probability of catastrophic failure” of an airplane’s computer be less than “10-10 per hour”.

It is common amongst computer scientists who deal with avionics issues to think that the reliability requirement for critical equipment with safety-related behavior is a probabilistic requirement. But it isn’t so. Probabilities of some sort do enter into assessment processes somewhere, but not so directly. It seems to me to be worthwhile to say some words about certification regulations. They can be somewhat abstruse unless you are a certification engineer (even for the regulator! See John Downer’s Trust and Technology: The Social Foundations of Aviation Regulation).

First, an aside about units: they should be “operational hours”, not simply “hours”. Most people probably correctly assume that. Besides, the difference between “operational hour” and “hour” for most commercial airplanes in continual, regular use is probably only a factor of two to four averaged over the service life of the airplane. Still, best to be precise.

Second, there is a figure known as the “10-9 xxxxx” (where “xxxxx” is variously “requirement”, “condition”, “criterion”, depending. I guess this is what Leslie is referring to, rather than a “10-10” criterion. There is a 10-9 criterion in the Accepted Means of Compliance (allied to the qualitative probability “Extremely Improbable”. The general functional safety standard IEC 61508, which does not apply to commercial aviation, although is sometimes used for military systems, is written to regard anything claimed below a reliability level of 10-9 per ophour as unrealistic (Ron Bell, Chair of the Maintenance Team for 61508 Parts 1-2, personal communication. Also, PBL self-communication: I am on the German national committee).

It is possible, though, that there are automotive systems, typically small electronics boxes fitted to many different common models of car, that might well get of the order of 1010 operational hours on them (Mike Ellims, personal communication).

The 10-9 criterion was looked at hard by John Downer, in his PhD thesis at Cornell The Burden of Proof (I don’t think it has been published yet, which is a shame. I have a copy).

So, on to the main theme.

The certification requirements for large airplanes (i.e., all commercial transports) are contained in a document known in Europe as CS-25, the 2003 and subsequent versions of which are available from the EASA WWW site.

First observation. Contrary to what it looks like from Leslie’s slide, the technical requirement for computers or computer behavior is nil. Computers inherit any conditions on failure behavior solely through the requirements on the pieces of kit which they control, in the sense that there are dangerous-failure requirements on the entire subsystem. And the requirements on the pitch control subsystems are purely functional, saying what loads they must also withstand under which conditions, and how they must dynamically behave. (Check them out for yourself here!) No probability, no probability terms, no quantitative probability. So it is misleading to associate any 10x condition with a requirement.

There is, however, an accompanying document to CS-25 called “Acceptable Means of Compliance” (AMC). That is, in order to demonstrate to the satisfaction of the certification authority that subsystem X does this and withstands that (as the certification requires), it is deemed by the authority acceptable to follow the guidance in the AMC. Of course, you can do it some other way also, if you can find one!

This is a notionally subtle but practically significant difference, between what is required and what is accepted as evidence that a requirement is fulfilled. If any system (such as the one Leslie illustrates) brings the airplane into a hazardous or catastrophic state, then it is an airworthiness issue and the problem has to be fixed. Full stop. And that is what is done. However, if the requirement were to be numerical, say “probability of dangerous failure of 1 in 10-9 per operating hour”, then one instance, or two instances, or even twenty instances, of a hazardous or catastrophic state, is/are compatible with that numerical requirement and the problem would not necessarily need to be fixed, since it could be argued that this very small probability had unfortunately been realised way earlier than expected. This difference is significant for lawyers arguing about the distribution of compensation (or “recovery” as they say), and compensation for loss is a universal principle some many thousands of years older than airplanes and their certification.

I note with some embarrassment, however, that IEC 61508 makes “probability of dangerous failure of 1 in 10x per operating hour” into a requirement, suffering the disadvantage I just noted of leaving it open, in the circumstance of a dangerous failure, if the requirement has been met or not. I guess the lawyers can expect some business :-)

Actually, the whole business of what “probability” means in “probability of dangerous failure” is a can of worms. Let me leave that for another time.

AMC uses terms for hazard: Minor, Major, Hazardous and Catastrophic. It also uses terms for probability: Probable, Remote, Extremely Remote, and Extremely Improbable. These are technical terms and when they occur in the requirements they are capitalised. The meaning of “Extremely Improbable” is (historically) “not expected to occur within the service life of the airplane type“, “service life of the airplane type” means here the total number of operational hours of all airplanes of that type throughout the entire use history of the airplane (assuming of course that the airplanes are maintained as designed). The meaning of “Extremely Remote” is “…..once….“; the meaning of “Remote” is “…once per individual aircraft, and several times in the service life of the type“; “Probable” is “…..several times in the life of an individual aircraft“.

These definitions come from previous versions of the certification documentation (when it was known as JAR 25) and may be found in a 1982 book by Lloyd and Tye, Systematic Safety, published by the UK CAA. These definitions will have been applicable directly to the certification of the two most popular airplanes flying today, the Boeing 737 series (certification mid 1960′s) and the Airbus A319/320/321 series (certification mid 1980′s), but not to the certification of, say, the Airbus A380, which is mid 2000′s. So let’s also look at later versions of the document.

The 2003 AMC-25 uses the terms for subsystem compliance, for example AMC 25-19 §6(c) says

(3) Extremely Improbable Failure Conditions: Extremely Improbable Failure Conditions are those so unlikely that they are not anticipated to occur during the entire operational life of all aeroplanes of one type, and have a probability of the order of 1 x 10–9 or less. Catastrophic Failure Conditions must be shown to be Extremely Improbable.

We see that in the current certification document the qualitative terms are firmly bound to quantitative probability statements.

The reason for this change is that, in the days of Lloyd and Tye, someone did a back-of-envelope calculation and figured that “service life of the airplane type” could be expected to be somewhat less than ten million hours. It was then! But, for example, Airbus’s safety chief, Yannick Malinge, when giving evidence to a Subcommittee of the Brazilian Parliament in August 2009, pointed out that the A320 fleet had at that time some 55 million operational hours or more (if I remember correctly. I also did a crude calculation of my own then, based on a guess at operational hours per year for a typical model, a uniform build rate since service introduction in 1988, and 25-year service life of an individual airplane, and came up with a similar figure). So for modern purposes that pre-1980′s back-of-envelope calculation is at least an order of magnitude too low.

Then, following on with the reasoning as in Lloyd and Tye, people apparently thought there would be about 100 airplane subsystems which could be a single point of catastrophic failure, and so the condition that no single-point catastrophic failure should occur in the service life is 1 in 10 million (1 in 107) divided through 100 airplane systems, so one in one billion per airplane system, leading to an average “probability” over the service life of 1 in 10-9 per operational hour.

Anyhow, that is where the 10-9 condition comes from, and nowadays the qualitative term is directly anchored to it, to avoid any calculations over expected fleet lives, since the actual fleet lives have proved to be rather different from that expected at certification time. Nobody expected they were going to sell going on for ten thousand airplanes of these types, but that is what it looks like might happen now!

And there is nothing in the AMC about reliability of computers. There are things about reliability of systems which are driven by computers, for example displays, AMC 25-11 §4(3)(i):

(i) Attitude. Display of attitude in the cockpit is a critical function. Loss of all attitude display, including standby attitude, is a critical failure and must be Extremely Improbable. Loss of primary attitude display for both pilots must be Improbable. Display of hazardously misleading roll or pitch attitude simultaneously on the primary attitude displays for both pilots must be Extremely Improbable.

So that’s what the regulations say and the acceptable means of compliance suggest you do. For insight into how this works out in practice, read John Downer!

I offer here many heartfelt thanks to Clive Leyman, quondam Chief Aerodynamicist of Concorde, who did his best to put me straight on all this over the last few years (I hope he thinks he succeeded!)