Certification Requirements for Commercial Airplanes

14 08 2011

I was browsing the invited lectures given under Martin Abadi’s College de France lecture series and came across this elegant, simple explanation of so-called Byzantine failures by the gentleman who invented the term, Leslie Lamport. Leslie’s two papers on the subject with Rob Shostak and Marshall Pease in the early 1980′s, Reaching Agreement in the Presence of Faults and The Byzantine Generals Problem, are seminal. Kevin Driscoll et al.’s SAFECOMP 2003 paper, Byzantine Fault Tolerance: From Theory to Reality, as well as Kevin’s brilliant keynote talk at SAFECOMP 2010, Murphy Was an Optimist (of which the slides seem no longer to be on the WWW) shows how prescient the SRI work was.

I met Leslie at SRI in 1984. Rob had just left, to finish and then sell his PC database SW “Paradox” with Richard Schwarz, starting his second career as a serial entrepreneur. A colleague commented at the time that the market for PC database software seemed already to be saturated, so leaving a good job for that was risky. I guess that’s how some make millions and some don’t! Marshall was still there, was reputed to be quite a successful stock purchaser, but is no longer with us.

Leslie’s Slide 2 shows what appears to be an Airbus A380, computers of some sort issuing pitch control commands (probably primary pitch control; Byzantine failures in the FMGEC software, which includes the autopilot, would not likely be safety-critical). And Slide 4 speaks of an “FAA requirement” that the “probability of catastrophic failure” of an airplane’s computer be less than “10-10 per hour”.

It is common amongst computer scientists who deal with avionics issues to think that the reliability requirement for critical equipment with safety-related behavior is a probabilistic requirement. But it isn’t so. Probabilities of some sort do enter into assessment processes somewhere, but not so directly. It seems to me to be worthwhile to say some words about certification regulations. They can be somewhat abstruse unless you are a certification engineer (even for the regulator! See John Downer’s Trust and Technology: The Social Foundations of Aviation Regulation).

First, an aside about units: they should be “operational hours”, not simply “hours”. Most people probably correctly assume that. Besides, the difference between “operational hour” and “hour” for most commercial airplanes in continual, regular use is probably only a factor of two to four averaged over the service life of the airplane. Still, best to be precise.

Second, there is a figure known as the “10-9 xxxxx” (where “xxxxx” is variously “requirement”, “condition”, “criterion”, depending. I guess this is what Leslie is referring to, rather than a “10-10” criterion. There is a 10-9 criterion in the Accepted Means of Compliance (allied to the qualitative probability “Extremely Improbable”. The general functional safety standard IEC 61508, which does not apply to commercial aviation, although is sometimes used for military systems, is written to regard anything claimed below a reliability level of 10-9 per ophour as unrealistic (Ron Bell, Chair of the Maintenance Team for 61508 Parts 1-2, personal communication. Also, PBL self-communication: I am on the German national committee).

It is possible, though, that there are automotive systems, typically small electronics boxes fitted to many different common models of car, that might well get of the order of 1010 operational hours on them (Mike Ellims, personal communication).

The 10-9 criterion was looked at hard by John Downer, in his PhD thesis at Cornell The Burden of Proof (I don’t think it has been published yet, which is a shame. I have a copy).

So, on to the main theme.

The certification requirements for large airplanes (i.e., all commercial transports) are contained in a document known in Europe as CS-25, the 2003 and subsequent versions of which are available from the EASA WWW site.

First observation. Contrary to what it looks like from Leslie’s slide, the technical requirement for computers or computer behavior is nil. Computers inherit any conditions on failure behavior solely through the requirements on the pieces of kit which they control, in the sense that there are dangerous-failure requirements on the entire subsystem. And the requirements on the pitch control subsystems are purely functional, saying what loads they must also withstand under which conditions, and how they must dynamically behave. (Check them out for yourself here!) No probability, no probability terms, no quantitative probability. So it is misleading to associate any 10x condition with a requirement.

There is, however, an accompanying document to CS-25 called “Acceptable Means of Compliance” (AMC). That is, in order to demonstrate to the satisfaction of the certification authority that subsystem X does this and withstands that (as the certification requires), it is deemed by the authority acceptable to follow the guidance in the AMC. Of course, you can do it some other way also, if you can find one!

This is a notionally subtle but practically significant difference, between what is required and what is accepted as evidence that a requirement is fulfilled. If any system (such as the one Leslie illustrates) brings the airplane into a hazardous or catastrophic state, then it is an airworthiness issue and the problem has to be fixed. Full stop. And that is what is done. However, if the requirement were to be numerical, say “probability of dangerous failure of 1 in 10-9 per operating hour”, then one instance, or two instances, or even twenty instances, of a hazardous or catastrophic state, is/are compatible with that numerical requirement and the problem would not necessarily need to be fixed, since it could be argued that this very small probability had unfortunately been realised way earlier than expected. This difference is significant for lawyers arguing about the distribution of compensation (or “recovery” as they say), and compensation for loss is a universal principle some many thousands of years older than airplanes and their certification.

I note with some embarrassment, however, that IEC 61508 makes “probability of dangerous failure of 1 in 10x per operating hour” into a requirement, suffering the disadvantage I just noted of leaving it open, in the circumstance of a dangerous failure, if the requirement has been met or not. I guess the lawyers can expect some business :-)

Actually, the whole business of what “probability” means in “probability of dangerous failure” is a can of worms. Let me leave that for another time.

AMC uses terms for hazard: Minor, Major, Hazardous and Catastrophic. It also uses terms for probability: Probable, Remote, Extremely Remote, and Extremely Improbable. These are technical terms and when they occur in the requirements they are capitalised. The meaning of “Extremely Improbable” is (historically) “not expected to occur within the service life of the airplane type“, “service life of the airplane type” means here the total number of operational hours of all airplanes of that type throughout the entire use history of the airplane (assuming of course that the airplanes are maintained as designed). The meaning of “Extremely Remote” is “…..once….“; the meaning of “Remote” is “…once per individual aircraft, and several times in the service life of the type“; “Probable” is “…..several times in the life of an individual aircraft“.

These definitions come from previous versions of the certification documentation (when it was known as JAR 25) and may be found in a 1982 book by Lloyd and Tye, Systematic Safety, published by the UK CAA. These definitions will have been applicable directly to the certification of the two most popular airplanes flying today, the Boeing 737 series (certification mid 1960′s) and the Airbus A319/320/321 series (certification mid 1980′s), but not to the certification of, say, the Airbus A380, which is mid 2000′s. So let’s also look at later versions of the document.

The 2003 AMC-25 uses the terms for subsystem compliance, for example AMC 25-19 §6(c) says

(3) Extremely Improbable Failure Conditions: Extremely Improbable Failure Conditions are those so unlikely that they are not anticipated to occur during the entire operational life of all aeroplanes of one type, and have a probability of the order of 1 x 10–9 or less. Catastrophic Failure Conditions must be shown to be Extremely Improbable.

We see that in the current certification document the qualitative terms are firmly bound to quantitative probability statements.

The reason for this change is that, in the days of Lloyd and Tye, someone did a back-of-envelope calculation and figured that “service life of the airplane type” could be expected to be somewhat less than ten million hours. It was then! But, for example, Airbus’s safety chief, Yannick Malinge, when giving evidence to a Subcommittee of the Brazilian Parliament in August 2009, pointed out that the A320 fleet had at that time some 55 million operational hours or more (if I remember correctly. I also did a crude calculation of my own then, based on a guess at operational hours per year for a typical model, a uniform build rate since service introduction in 1988, and 25-year service life of an individual airplane, and came up with a similar figure). So for modern purposes that pre-1980′s back-of-envelope calculation is at least an order of magnitude too low.

Then, following on with the reasoning as in Lloyd and Tye, people apparently thought there would be about 100 airplane subsystems which could be a single point of catastrophic failure, and so the condition that no single-point catastrophic failure should occur in the service life is 1 in 10 million (1 in 107) divided through 100 airplane systems, so one in one billion per airplane system, leading to an average “probability” over the service life of 1 in 10-9 per operational hour.

Anyhow, that is where the 10-9 condition comes from, and nowadays the qualitative term is directly anchored to it, to avoid any calculations over expected fleet lives, since the actual fleet lives have proved to be rather different from that expected at certification time. Nobody expected they were going to sell going on for ten thousand airplanes of these types, but that is what it looks like might happen now!

And there is nothing in the AMC about reliability of computers. There are things about reliability of systems which are driven by computers, for example displays, AMC 25-11 §4(3)(i):

(i) Attitude. Display of attitude in the cockpit is a critical function. Loss of all attitude display, including standby attitude, is a critical failure and must be Extremely Improbable. Loss of primary attitude display for both pilots must be Improbable. Display of hazardously misleading roll or pitch attitude simultaneously on the primary attitude displays for both pilots must be Extremely Improbable.

So that’s what the regulations say and the acceptable means of compliance suggest you do. For insight into how this works out in practice, read John Downer!

I offer here many heartfelt thanks to Clive Leyman, quondam Chief Aerodynamicist of Concorde, who did his best to put me straight on all this over the last few years (I hope he thinks he succeeded!)



The British Phone-Hacking Scandal

27 07 2011

I’ve been watching the phone-hacking scandal closely, even to the point of reading the Guardian’s timeline of the parliamentary debate last Wednesday (20th July) every few minutes or so. I don’t agree with those in parliament who suggested that “the people” are tired of it. This people most certainly is not. It says a lot about modern Britain. So, what is this lot that it says about modern Britain? Here, a beginning.

First, a preamble. I am mature enough to feel the need to start racing kids on their bicycles on the street and to regard the term “fogey” as an approbation (but I will correct you if you prepend the term “old”). And to say whatever I like about people under 40 (for example, Mr. J. Murdoch; see below).

I read an article in the Independent this morning in which it says


On two occasions, James Murdoch and former News International chief executive Rebekah Brooks were given confidential defence briefings on Afghanistan and Britain’s strategic defence review by the Defence Secretary, Liam Fox. A further briefing was held with Ms Brooks, Rupert Murdoch and the Sunday Times editor John Witherow.

and I think of what Abe Lincoln might have said: “Government of the press, by the press, for the press, shall not perish from the earth.” There is just more, and more, and more of it. And then more.

Newspapers are essential. Let me rephrase: Good newspapers are essential. I think the British press has given up its former partial role as informer and arbiter of social reality (I am not quite sure how to phrase it – the experience of reading a newspaper article and knowing you were getting objective and moderately complete information through your reading it) – a role which papers such as the NYT, Washington Post, and in Germany FAZ and SZ still play, and which at least The Times used to play in GB and no longer does (for example, The Times’s extremely poor and quite poorly-opinionated coverage of the Air France South Atlantic accident, as compared with that of the NYT). Now, the Brit/American Roger Cohen, who writes columns for the NYT and is almost always worth reading, had an interesting perspective. A week ago, he argued that Rupert Murdoch had been good for the British press, on the basis that he had kept it alive and thriving at a point at which it could well have died (he suggests that The Times would likely have disappeared were it not for Murdoch). I think much of that may well be right – it is hard to see how the newspaper business could have survived, given the then-demands of the printers’ unions, and Murdoch single-handedly changed that situation. But the daily printed word seems to have become much less trustworthy in the UK in a way in which, for example, the best newspapers elsewhere (NYT, WP, SZ, FAZ) have not. Even the WSJ, another paper which can be argued to have been Murdoch-rescued, has not succumbed. There just seems to be something about the British press in which I suspect Murdoch&family to have significant influence over content. I don’t have proof, just suspicion.

On to government. Everyone notes wrily the French “corporate state” being run by ENiAcs, but few people have noted how Britain has reverted to being run by Oxbridge graduates – this time, indeed, by people who were once what we used to call “little rich kids”, former members of the Bullingdon club (look it up in Wikipedia). Indeed, five members of the current government went to my very college. Now, I am moderately attached to and supportive of my college, but I am also very aware of how one’s upbringing affects one’s attitude to life and am sceptical that people who were as financially and socially privileged as some of these were can understand, even begin to solve, issues to do with Britain’s poor and underprivileged, or the structural-economic issues involved with Lancashire, Yorkshire, Northumberland and Durham, or with Scotland, indeed with any parts except London and enclaves of wealthy people. Or even figure out what is right and what is wrong with the NHS, or with state secondary education, neither of which any of them have ever had to experience.

I believe that the NHS and the state education of the sort I received are two of the great achievements in Britain of the last century. And I do have personal experience of three health systems, and three university systems, as well as intimate knowledge of features of school systems, over decades in three very different countries – and of course three newspaper systems – so I like to think my perspective is informed.

The NHS is being slowly destroyed, I think, through successive poor policy and management over decades. But I don’t have more than this to say here.

I think that state secondary school education has been on the down for decades. I entered the English university system from school; it was then scholastic-inclined and elitist, with intake some very few percent of the population. After some culture shock at then entering a system, the University of California, which took some few percent of a very different population, I came gradually to see the enormous advantages of a higher-education system which addressed over 50% of school leavers (in US universities and community colleges, in almost all of which one could do the first year or two of any university coursework at – then – no cost).

So I had hopes, for a decade or two, for the English university system, but perceiving the conditions under which my English colleagues work, and what has happened to courses and coursework and now student fees, I can’t any longer say that I think things have improved. What I can say is that for younger academics at the start of their careers the system is still superior, more humane and more encouraging, than most or all of those in continental Europe, or even the US. So that remains a beacon of hope (sorry for the cliche). But for the general British university situation, I can’t see that the privileged rich kids in government can have much personal insight into the matters that count: who should be going to university, why, and under what conditions. Without personal insight and experience, I don’t see how one can distinguish policies that might work from those that won’t. I can’t see, for example, any 18 year old who has been trying to manage a couple of quid a week pocket money being able to make a well-informed decision that going into debt for £27,000 (£9,000 per year) plus living expenses is going to be at all worth it for hisher future life. Maybe so for, say, law, microeconomics or engineering, but not for, say, Eng. lit., Latin&Greek, French lit., German lit., philosophy, or those other courses of study which one might imagine would give a future lawyer, politician or civil servant some perspective on the variety of life with which they will be dealing and train some important skills such as producing a coherent argument, and being able to write decently. If such a choice had been presented to me, I would probably have carried on working at the CEGB (anyone remember them?) and taken engineering classes in night school. In contrast, I can see that choice being easily made, not only for British rich kids, but also for many or most young Americans. Let me just say that money plays a different role there; enough that it was part of my culture shock when I got there.

So, back to the scandal, what is significant in this one?

1. The extent to which it has become clear how Britain is run by elites, many of whom appear to move in the same social circles. At least Blair used to hob-nob with rock stars, most of which are self-made people who were not financially privileged when they started, and probably still remember what life was like with mum and dad trying to figure out if the family could afford to go on holiday that year, rather than what fun they used to have in the Bullingdon club. But one cannot imagine either him or Brown regularly lunching and partying with, say, the Gallagher brothers.

2. The extent to which it has become clear how British life is influenced by those elites, and in what direction. You’ll find articles about Paris Hilton’s, Lindsay Lohan’s and Britney Spears’s latest jaunts in the NYT also, but you will also find technical details of GE Boiling Water Reactors and why they are susceptible to this-and-that. The German press will point you to technical documents of the German regulator and safety watchdog available on the WWW. Whereas one will search the British press fruitlessly for any details concerning British nuclear power plants.

3. The extent to which the police appear to have been influenced by those elites. When I grew up, the bobby and the doctor were examples of public servants who performed useful functions largely independently of anything and anybody else (although of course there were always corrupt bobbies and incompetent doctors). Wednesday, I read through the Home Affairs Select Committee report and was astonished at the police behavior, which appears to be collusive to an extraordinary extent at high levels. But maybe those who have actually lived in Britain in the last two decades are less astonished?

4. The extent to which the old trope “I’m the top guy. I didn’t know anything about what was going on lower down” is nowadays used as a defence of one’s (in)actions. Thirty years ago, it was the major reason for resigning! (As indeed Messrs Yates and Stephenson have done – so it still is to some extent. And Hayman got hammered by the Home Affairs Select Committee when he tried to use it, so someone still remembers the “old days”.)

5. I am, though, pleased to see the effectiveness of Select Committees. James Murdoch saying he had been advised by his consultants to tell the truth (oh, well, nice to know you get advice from wise people, Mr. Murdoch!). And two days later Crone and Myler contradicting his “defence” as in point 4. Indeed, it is hard to believe any business person agreeing to settle a privacy-invasion case for ten times the going rate (Mosley won £60,000 against the NOTW in court at about the same time, and even that was up to ten times the award of many successful privacy-invasion suits), plus full legal expenses, without asking why. I suspect that makes James Murdoch toast, business-wise, whatever the truth turns out to be. I also suspect he may have to work a little to stay out of jail, but see point 3 above. So even though they may be pocketing taxpayers’ money to have their moats cleaned, some politicians are apparently still able to do a decent job on other people’s misdemeanors.

6. There are the kinds of things which either makes one regret that one didn’t go into politics, or very relieved that one stayed out. The financial collapse three years ago (which, by the way, I though was brilliantly handled by Gordon Brown, alone amongst Western leaders). But there are also the kind of things which lead me to general despair. This is one of those. It’s a “time to emigrate” moment. Except that I did, and now I’m running out of places. Canada? It’s cold and there’s that snorting elephant to the south. Australia? I’m not sure I have the energy to learn another new language. New Zealand? All those sheep! But I’d feel at home with the earthquakes.

7. Maybe it’s time to form a new political party for those who work hard, pay their taxes, and expect them to go somewhere useful like health care, care of the elderly, education, effective oversight of finance and critical infrastructure, public transportation, and effective urban reinvigoration. (Germany at least gets the last two right.) Wait a minute! Didn’t we have one of those? What happened to it?



Chinese Train Collision

26 07 2011

On Saturday July 23, a high-speed train lost power and either slowed down or stalled, and a second one rear-ended it, in or near Wenzhou city, on a line in Hanhzhou province: the Independent newspaper reports.

The lost power was said to be due to a lightning strike.

Unfortunately, the collision took place on a viaduct and four cars of the moving train fell off, some 20-30 meters. Some 30-40 people are reported to have died.

It is important to keep things straight. Electric railways have been affected by lightning before. The view of experienced rail safety people is somewhat different from the press reports. Here is David Tombs, one of the safety engineers of Queensland Rail, in a message to the York safety-critical systems mailing list :


News sites are emphasising the loss of power on the first train, but that by itself should not lead to a collision. To allow second train onto the same piece of track, there has been a clear and tragic breach of safeworking.

Exactly.

“Safeworking” is based on the block principle. The track is divided into logical blocks, to which access is controlled (by some form of signalling, or remote control). At most one train is to be in one block at any given time, unless all trains in the block are operating under “stop-on-sight” rules. This safeworking principle (rather: set of principles) has evolved for well over a century and is enshrined in the operation of every multiple-train railway line on earth. High-speed lines usually use some form of continuous sensing of train position, and in-cab signalling (the train is signalled, even controlled remotely, by a remote controller who knows exactly where the train is and at what speed it is travelling).

Trains travelling at high speed need kilometers to slow down and stop. Blocks are correspondingly long. That means that dispatch times between successive trains on the same line are correspondingly long, and that limits the capacity of the track. People have thus been attracted to the idea of “moving blocks”, whereby the exclusion area moves with a moving train, and is no longer geographically fixed. But this is an idea, and is by no means technically mature. Rail people are understandably reluctant to give up a system, the fixed-block system, which has proved its worth over more than a century. Further explanation of moving-block technology can be found at http://www.railway-technical.com/sigtxt3.shtml.

James Schapel has identified the signalling-system provider, HollySys , in another message to the York list.


HollySys claim to be one of only five automation control systems and products providers approved by China’s Ministry of Railways in the 200km to 250km high-speed rail segment, and one of only two automation control systems and products providers approved in the 300km to 350km high-speed rail segment:

http://www.hollysys.com.sg/home/index.php/about-us

HollySys also claim that its Automatic Train Protection (ATP) has been certified to Safety Integrity Level (SIL) 4 according to the European Committee for Electrotechnical Standardization (CENELEC) standards:

http://www.hollysys.com.sg/home/index.php/investor-relations/press-releases/522-october-27-2009-hollysys-automation-announces-its-proprietary-high-speed-rail-atp-product-certified-by-european-safety-standard

This claim, that such-and-such a system has been “certified to SIL X” according to some standard which uses Safety Integrity Levels (SILs) is becoming more prevalent amongst suppliers of safety-critical technologies. It is well to inquire what it is supposed to mean.

The Safety Integrity Level of the applicable CENELEC rail standards are based on a permissible average rate of dangerous failures of the system. A “SIL 4″ system which is in continuous operation is only allowed to fail dangerously on average once every hundred million to billion operating hours.

Taken literally, the claim of “certification to SIL 4” can only mean that some overseer organisation has checked arguments that the system only fails dangerously on average once every hundred million to billion operating hours, and has said it thinks those arguments are good.

One should be extremely suspicious of any such assertion. Most arguments I and others have seen for such extreme levels of safety-function-reliability are inadequate.

In particular, if the signalling indeed behaved as claimed, the Chinese accident, which has happened well within a few million hours of operation (recall that there are only about 9,000 hours in a year) demonstrates empirically that it is very likely the system has a dangerous failure rate much higher than that.



A Fukushima Diary

19 07 2011

In preparation for my talk at the 11th Bieleschweig Workshop, on the Fukushima accident and systems prone to extreme usafe events, I have prepared a synopsis of my contributions to the mailing list on the accident which we set up in Bielefeld, called A Fukushima Diary. It’s about 110pp long, so a little too long for a blog post.



Standardising Causal Analysis

30 06 2011

As a member of the German national committee for standards concerning the functional safety of electrical/electronic/programmable-electronic systems (known in the jargon as E/E/PE systems), I received on 11th May a document sent to another standards committee, proposing an international standardisation project for Root Cause Failure Analysis through the International Electrotechnical Commission, IEC, the ISO affiliate responsible for things computerish.

Now, I like to think I know something about Causal Analysis of accidents involving engineering artefacts. I proposed my method Why-Because Analysis (WBA) (see also examples of WBA amongst Causalis publications), based on the insights into causality of, amongst others, David Hume and David Lewis, somewhere around 15 years ago. We used it then predominantly for analysing accidents to large commercial aircraft, whose operation increasingly involved computational components and I was one of the very few people I knew who was both an instrument-rated pilot and an analyst of the kinds of distributed computer-based systems found in such aircraft. Our (rather, my group’s tech-transfer company Causalis) first commercial analysis contract was 1998, as an advisor to the lawyers for the plaintiffs in the civil lawsuit concerning the 1994 Nagoya A300 accident.

WBA attracted somewhat of a following. Two divisions of Siemens, Rail Automation (which makes signalling systems) and Mass Transit (which makes trams) have adopted it as an internal company analytical procedure, and we started the Bieleschweig Workshops in Bielefeld and Braunschweig, whose first few meetings concentrated on Root Cause Analysis. The two German university departments which aid the German railways on accident analysis, those IfEV at TU Braunschweig and the Institute for Rail Systems at TU Dresden, also adopted WBA for research and teaching. We continue to use it of course to aid accident-compensation negotiations, and even to aid the criminal defence of a inappropriately-accused microlight inspector.

Nancy Leveson at MIT has an accident analysis method which she announced in the early 2000′s, STAMP. It is based on a hierarchical model of social organisation, due to Rasmussen and Sveding, with each level construed as a feedback control system. WBA is based largely on the rigorous application of one specific test for being a causal factor, the Counterfactual Test. Colleagues of ours at Siemens and TU Braunschweig compared the methods, as they then were, in 2003. There are of course other methods – Chris Johnson has surveyed some in his book, such as MES, STEP, MORT and PRISMA. Chris seems partial to ECF analysis. The ATSB used to use the so-called “Reason model”, after the analyst Jim Reason, formerly of Manchester University, and now uses Accimaps, a simple hierarchical representation due to Rasmussen before the more elaborate work with Sveding. (My contention is that all these methods use an informal rule-of-thumb intuitive version of the Counterfactual Test, and add stuff on top. WBA makes sure one gets at least the counterfactual analysis right, whatever else one wants to do. A universal method, if you like, even if you don’t use our software.)

I read in the working draft of the RCFA standardisation proposal, under “Analysis Phase”, that


The analysis phase uses the collected data to build the sequence of events leading to the failure event, which is presented as a cause chain. The cause chain determines the direct, contributing, and root causes of the failure event. The direct cause is the first one in the cause chain, thus directly leading to the failure event. The root cause is the last one in the chain, while the contributing causes are the ones in between the direct and the root causes.

This root cause is the stopping point and is the place where, with appropriate corrective action, the problem will be eliminated and will not reoccur.

To be effective the analysis must establish a sequence of events or timeline to understand the relationships between contributory factors, the root causes and the failure event. The analysis will identify the reasons why the causes immediately preceding and surrounding the failure event existed, working backwards to the root causes.

I was -negatively- astonished. Much of this material contradicts what those of us who work in accident and failure analysis know.

I know people who are involved with the relevant German committee as guests. I wrote to our committee administrator listing some technical mistakes. He forwarded my note via the administrator of the responsible committee to the vice-chairman of that committee, who contacted the people I said I knew for their opinion on my technical points. They all responded that they agreed with the technical points (of course!). But it is also part of the process to decide whether a country supports a standardisation effort on a particular subject or not, and some indicated they would support the project; they are, however, guests not committee members.

I also wrote my German committee colleagues directly, as well as those others I know around the world who are interested in causal analysis of engineering failures and accidents. I wrote this note to the Safety-Critical System mailing list at the University of York: (Readers may follow the thread most easily by going to the archive page , choosing “thread view“, and searching for “New International Standard – Urgent Action Needed”, because there are ostensibly two thread titles referring to the same subject matter, and there are also some replies to my original note which occur in the thread view but are somehow not regarded as part of the thread. In the thread view, the messages are spatially, although not temporally, contiguous.) The responses were almost uniformly negative. Check out, for example, Rob Alexander’s, Andrew Rae’s and Nancy Leveson’s devastating short comments. Bertrand Ricque’s note pointed out that standardisation efforts may well be motivated by reasons other than technical.

I said the following in my personal e-mailing to colleagues.


The following things are most obvious technically wrong with this.

A1. There is usually no “cause chain”. There is rather an interconnected network (or mathematical “graph”) of causal factors. Here “usually” means: in my work I have seen no case of a chain which provided anything like an adequate causal analysis of a failure.

A2. There is usually no single “direct cause” as here defined. Rather, the causal factors are dependent on what events and states are regarded by the analyst(s) as relevant, and *relative to that choice* there may or may not be a “direct cause” as here defined. In most of the failures I have analysed, there is no single “direct cause” as here defined; rather, many.

A3. There is usually no single “root cause” as here defined. Rather, the causal factors are dependent on what events and states are regarded by the analyst(s) as relevant, and *relative to that choice* there may or may not be a single “root cause” as here defined. In most of the failures I have analysed, there is no single “root cause” as here defined; rather, many.

A4. There is usually no “stopping point” as here defined. Rather, the analyst(s) must invoke a “stopping rule” to say what further causes they no longer consider as relevant. This stopping rule is best formulated explicitly, and it represents a choice by the analyst (to use that explicit rule) rather than anything objective in the causality itself.

A5. The standard does not adequately define the various notions of “cause”, despite there being logically precise definitions in the scientific literature since, at the latest, 1973, and precise engineering-relevant definitions in the engineering literature since the 1990′s.

To support point A5, consider the entire set of definitions of “cause” in the proposed standard:

[begin quote]

3.2 failure cause: circumstances during specification, design, manufacture or use that result in failure

3.4 direct cause: condition or action that directly resulted in the failure event, without which the failure event would not have occurred

3.5 contributing cause: condition or action that occurred that did not directly lead to the failure event and therefore by itself would not have caused the failure event

3.6 cause chain: cause and effect sequence in which a specific action creates a condition that contributes to or results in a failure event

3.7 root cause: condition or action that sets in motion the cause and effect chain that creates the failure event

[end quote]

The following things are technically wrong with these

B1. They are imprecise; e.g. “circumstances…that result in….”
i. What are “circumstances”? Events or states?
ii. What does it mean to “result in”?

B2. With the exception of the word “directly”, this definition is the so-called Counterfactual definition of causal factor; most (in some representations, all) of the factors represented in a causal graph satisfy this definition, not just one. The word “directly” remains undefined.

B3. All events and states in the world, *except* those that “directly led to the failure event”, satisfy this definition of contributing cause. My typing these words now did not directly result in the Fukushima nuclear power plant accident and therefore by itself would not have caused the Fukushima nuclear power plant accident. It follows, word for word, that according to this proposed standard my typing these words now is a contributing cause of the Fukushima nuclear power plant accident. This suggestion is absurd! But it is what the definition says.

B4. “Cause…chain” is a “sequence”. “Sequence is not further defined. According to normal usage, a sequence is a linear ordering, a succession of items, one following another. There is nothing wrong with this definition as such. There is something very wrong with an analysis technique which suggests that “cause….chain” under this definition is what needs to be identified, as in A1 above.

B5. In the definition of root cause, the concept “sets in motion” is a pure metaphor and not a precise term. A causal chain does not “move” in anything other than a metaphorical sense. The causal graphs I print only move when the piece of paper on which they are printed moves. Otherwise, the network stays the way it was printed, thank heavens!

B6. In the definition of root cause, a cause chain is said to “create” a failure event. The word “create” applied to events is not further defined and it is unclear what it should mean.

Some of the comments I had made were redacted and supported by the responsible German committee and forwarded to the IEC. The IEC commentary format is very restrictive. It allows one to comment only on individual sentences in the original. This means that overall critique, such as that something is wrong-headed and contrary to the state of the art, such as offered by Andrew Rae and Nancy Leveson in the notes referenced above, cannot be included.

Here is my paraphrase of what made it of my critique into the official commentary (I may not, of course, distribute the original).


In most cases, there is no “causal chain”: there is a (mathematical) graph (informally: a network) of causes.

“Stopping point” is an undefined term. It should be replaced by “stopping rule” which is explicitly defined by the analyst.

There is often no single place where the “problem” can be eliminated, but rather multiple places.

There is mostly no single “root cause”: there are usually multiple causes.

One can see very clearly here the reduction, both in words and in content, from my original critique. The IEC and thereby the proposal originator and drafter(s) of the proposal standard receive only this reduction.

Overall, I understand there was sufficient international support for a standardisation effort on Root Cause Failure Analysis to go ahead. Not all countries which supported the proposal nominated “experts”. Five did. You will search the literature on causal analysis in engineering in vain for any of their names.

There are a number of things wrong with this process. Some of them have been eloquently articulated by others I have referenced above. A further one is the reduction in content of the critique. Another is the obscurity of the process. Yet another is that the “experts” drafting the standard do not include anyone with an international scientific reputation in causal analysis.

There are a set of tropes about engineering standardisation procedures which were proposed by Derek Jones in the thread (see also his later note). I responded to – or, as a colleague called it, demolished – Derek’s points in this note.

Finally, now that we apparently have an international standardisation project for root-causal analysis of engineering failures, I encourage everyone who knows about such things to weigh in with their views. Via the appropriate national committee which, as Derek says, you can maybe manage to find out about with a few well-placed phone calls to some national organisation you think might know. But good luck in fitting your opinions on the IEC comments form. You have seen above what happened to mine.



Probabilistic and Possibilistic Analysis, the Precautionary Principle and EUEs

13 05 2011

Yesterday, Werner U brought our attention, on a closed mailing list of which I have been a member for almost two decades, to a study by John Mueller, a political scientist at Ohio State University, and Mark Stewart, a civil engineer at the University of Newcastle in New South Wales of the costs and potential benefits of actions taken by the US Department of Homeland Security (DHS) to protect against supposed-possible terrorist attacks in the US . I found the article eye-opening.

Mueller and Stewart argue that, despite recommendations to do so by entities such as the US Government Accounting Office (GAO), and observations of its lack by the National Research Council, which was tasked to produce a study in 2010, the DHS has not effectively used any cost/benefit analysis on its anti-terrorism techniques. The authors performed a search of DHS documents, news reports and testimony, and the result was (p4) that

we have been able to find only one published reference to a numerical estimate of risk reduction after an extensive search of the agency’s reports and documents.

They claim that the expenditures on anti-terrorist measures are (p2) “one trillion [that is, 10^(12)] dollars and counting.”

They put what they essentially argue is a disadvantageous distortion of priorities down to “probability neglect” and a focus on worst-case scenarios. That is, the DHS aims to protect against worst-case scenarios while ignoring the likelihood of them being realised. Mueller and Stewart point out that the annual number of people who have died world-wide in attacks by Islamic-fundamentalist extremists since 9/11 (2001, almost ten years ago) can be estimated at between 200 and 300 (p 12), which is less than the number of people who drown in bathtubs each year in the USA (p 12). If you are a US taxpayer, there seems to be a prima facie case for looking into this issue somewhat harder (or spending more money on your bathtub, whichever…).

Amusingly, they point out some “innovative” methods used by the DHS for assessing risk, such as adding the probability to the severity. De Moivre, who first defined risk in his seminal work De Mensura Sortis, which celebrates its 300 anniversary this year, being a good businessman, would likely have taken the opportunity to offer the DHS a tutorial :-)

However, Mueller and Stewart also lambast “possibilistic thinking”, or “worst-case thinking”, citing in particular Bruce Schneier’s article from exactly a year ago, Worst-Case Thinking. I am an avid reader of Bruce’s opinions, but I don’t agree with a generic denigration of “worst-case thinking”.

In correspondence with Lee Clarke, who has a book analysing and advocating worst-case thinking as an analysis technique for systems which are prone to extreme unsafe events (EUEs) as I have previously called them here, I have come to understand the method Clarke advocates as a form of hazard analysis – enumerating the hazards by looking at potential outcomes and analysing the possible events which lead to them. I am a great fan of hazard analysis of this sort, as readers of our recent technical work will know. Indeed, it is advocated by almost all the international engineering standards with which I am familiar (I work on a standardisation committee for functional safety) as a necessary technique to be performed during initial steps in development (as well, in many standards, as continuing throughout system development). I have also been impressed by Charles Perrow’s prediction of the exact failure sequence at Fukushima Dai-ichi in his 2007 book, as I have mentioned before, and have wondered why it takes a sociologist to perform – and publish – the kind of hazard analysis that should be performed by the plant engineers.

But there are obvious issues with only performing worst-case hazard analyses and not attempting at some point to assess their likelihood, as Mueller and Stewart show us. I don’t mean to solve this tension here, but just to draw attention to this wider, apparently controversial and important issue of the effectiveness and limits of techniques such as hazard analysis and risk analysis. One major argument for not using estimates of likelihood is that, for EUEs, they are rarely available. One major argument for using some estimate of likelihood, even if crude, is Mueller and Stewart’s.

One particular problem does seem to be invalid argumentation that is commonly used when talking about severe events. Mueller and Stewart point out the “adding probabilities” argument. Another one is as follows. One characterises likelihood as “low”, turns “low-likelihood, high-severity events” into a pseudo-category and thereby enables an argument, as Nancy Leveson has noted in private conversation that she has observed in practice in one particular process industry, of declining to deal with obvious risks that could be mitigated, and, according to people such as myself and her, should be mitigated.

Robert Dorsett has observed in private that the DHS can seem to be motivated by the Precautionary Principle: if one sees a potential for harm, best to take actual measures to avoid that that potential is realised. I have considered the Precautionary Principle as well as other principles in an essay I wrote some ten years ago on a practical example of everyday risk analysis by the HSE in the UK. Cass Sunstein believes the principle is what philosophers of action call a “practical paradox“, something that is impossible to implement as written. He writes on pp102-105 of his 2002 book Risk and Reason that

All over the world, there is increasing interest in a simple idea for the regulation of risk: In the case of doubt, follow the precautionary principle……. There is some important truth in [it]. Sometimes it is much better to be safe than sorry… But there is a larger problem. The precautionary principle can provide guidance only if we blinker ourselves and look at a subset of the harms involved. In real-world controversies, a failure to regulate will run afoul of the precautionary principle because potential risks are involved. But regulation itself will cause potential risks and hence run afoul of the precautionary principle too; and the same is true for every step in between. Hence, the precautionary principle, taken for all that it is worth, is literally paralysing. It bans every imaginable step, including inaction itself.

Given the general inclination which Robert notes to invoke the Precautionary Principle at every turn, is seems to me wise to keep Sunstein’s argument ready to hand. The Precautionary Principle is a blunt instrument, and it is wise to use it more subtly. Mueller and Stewart, following Schneier, could be taken to argue that worst-case, possibilistic, thinking, is also a blunt instrument, and it is thereby equally wise to take Clarke and Perrow’s suggestions to use it more subtly.



11th Bieleschweig Workshop: The Fukushima Accident and Systems Prone to EUE

22 04 2011

Readers might like to know about the 11th Bieleschweig Workshop on System Engineering, which will take place in Bielefeld in the Senate Room of the University on 3rd-4th August, 2011. The topic will be Interacting with Extreme Risk: The Fukushima Accident. We organise the Bieleschweig Workshops.

I think that there exist the foundations of a consensus amongst engineers and social scientists as well as other observers on how to deal as a society with the use of technologies which carry with them the possibility of extreme unsafe events (EUE), henceforth below “systems prone to EUE”. The goal of the 11th Bieleschweig Workshop will be to attempt to formulate such a consensus, with special focus on the Fukushima accident as an example. The 11th Workshop will follow previous Bieleschweig formats in interspersing formal talks with lots of time for discussion, and including sessions for discussion around position papers on selected topics (“panel sessions”).

For an example of what I mean by consensus, see The Economist’s lead article from its April 23, 2011 edition, In Place of Safety Nets: Lessons from Deepwater Horizon and Fukushima in which three rules are proposed for “mak[ing] it easier to cope with the failures of such brittle technologies“.

The first rule is “the firms involved have to accept that ….. disasters will happen“. This is closely related to the proposal of Lee Clarke (below) that we should use “possibilistic thinking” in assessing systems prone to EUE (see this interview with Lee).

The second rule is “to develop at least some broadly applicable technologies for repair and remediation before they are needed….Fukushima and other nuclear plants seem oddly lacking in robotic access to places where workers cannot or should not go.“. Just so! I and my colleagues at CITEC, who inter alia develop robots, heartily agree! For example, colleagues Prof. em. Dr. Holk Cruse, Dr. Axel Schneider, and Prof. Dr. Volker Dürr, of CITEC and the Department of Biological Cybernetics have developed biomimetic systems, such as the robot Tarry (in German, sorry! From left, that’s Axel and Volker in the picture) which can negotiate uneven surfaces such as rubble piles, using mechatronic systems derived from stick insects (they also have a great stick insect colony, which is a lot of fun if you like playing with critters!).

The third rule is “situational awareness is invaluable“. The Economist means that one needs adequate sensing of fundamental parameters, even in extreme failure situations, and points out the apparent lack of such at Deepwater Horizon and Fukushima. Whatever the reasons for this lack, and some are technical, I think everybody I know who works with or on any safety-critical systems such as airplanes, trains, power plants and chemical plants agrees with the point that you need to assure these data somehow!

The Economist continues that “one solution to the problem of ever-growing requirements is “safety-case” regulation“, which seems very similar to my proposal for requiring a continuously-maintained safety case for systems prone to EUE in my post of 27 March 2011, Fukushima, The Tsunami Hazard, and Engineering Practice

So, are you persuaded that there might be consensus? (At least amongst a small group of English journalists and a few sociologists and system-safety experts :-) )

If so, and you deal professionally with safety and risk, do come join us in Bielefeld August 3-4 2011! Please email me if you would like to come.

Confirmed participants are the sociologists of technology Charles Perrow (Yale), Lee Clarke (Rutgers) and John Downer (Stanford), as well as the system-safety engineers Nancy Leveson (MIT) and Martyn Thomas (Thomas Associates), and little old me (Uni Bielefeld and Causalis Limited). Romney Duffey (AECL) and Robin Bloomfield (City Uni and Adelard) have expressed their intention to attend.

Professor Perrow wrote Normal Accidents (Basic Books, 1984, revised edition Princeton University Press, 1999), in which the Normal Accident Theory was proposed, introducing what is now called by many in system safety a System Accident. He also wrote The Next Catastrophe, (Princeton University Press, 2007, revised 2011), in which the exact failure mode of the Fukushima Dai-ichi reactors was foreseen (a natural event taking out primary power, followed by “flooding” taking out secondary power), as I pointed out in my post of 14 April 2011 on memes.

Professor Clarke wrote Worst Cases (University of Chicago Press, 2005) which proposed that the “probabilistic thinking” associated with risk analysis was insufficient to enable us to make wise decisions about use of systems prone to EUE, and that considering the EUE and the consequences without attempting to assess probabilities of it happening, which he calls “possibilistic thinking”, is a more appropriate tool for making socially-responsible decisions about systems prone to EUE.

Dr. Downer is Zukerman Fellow at the Freeman Spogli Institute for International Studies at Stanford University, and works on ultra-reliability (in particular civil transport aircraft certification and engineering) and systems prone to EUE.

Professor Leveson is a founder of the discipline of software safety engineering, and is a major contributor to system safety engineering with her methods STAMP for accident and catastrophe analysis and STPE for Hazard Analysis. She wrote the fundamental reference book Safeware (Addison-Wesley, 1995) (also see her description) and has written a new book Engineering a Safer World (MIT Press, to appear 2011). She consulted for the Columbia Accident Investigation Board looking into the loss of the Space Shuttle Columbia, was a member of the Baker Panel investigating the Texas City Oil Refinery explosion and is expert advisor to the Presidential Oil Spill Commission (Deepwater Horizon).

Professor Thomas is Director and Principal Consultant of Martyn Thomas Associates Limited and was founder of Praxis, now Altran Praxis, developers of the SPARK toolsuite and set of techniques for the development of demonstrably highly-reliable software. He co-wrote the US National Academies report Software for Dependable Systems: Sufficient Evidence? (National Academies Press, 2007), and most recently Chairman of the UK Royal Academy of Engineering GNSS working group and co-wrote the Academy report Global Navigation Space Systems: Reliance and Vulnerabilities (Royal Academy of Engineering, 2011) which was released the day before the Tohoku megaquake.

I and my group are here in Bielefeld, where we developed the causal analysis method Why-Because Analysis (WBA) (see also Causalis Limited, Publications for some more examples) and are developing the hazard analysis method Ontological Hazard Analysis (OHA) (see for example this paper or Bernd Sieker’s PhD thesis (in German). I am a Director of the system-safety consulting company Causalis Limited whose clients include major legal firms and insurance companies in civil aviation, as well as individuals in criminal and civil cases related to accidents. I am member of the German standardisation committee for functional safety of E/E/PE systems, DKE GK 914, as well as of the subcommittee DKE 914.0.3 “Safe Software” and the IEC “Maintenance Teams” for the international standard IEC 61508. I chair various related DKE advisory committees.

Romney Duffey, Robin Bloomfield and John Knight have indicated their intention to attend. We hope also to have participation from experts in nuclear power from Japan, Germany and other countries.

Dr. Romney Duffey is Principal Scientist of Atomic Energy Canada Limited and coauthor of the book with John Saull Know the Risk: Learning from Errors and Accidents (Butterworth-Heinemann, 2003).

Professor Robin Bloomfield is Director of the Centre for Software Reliability at City University, London and founder and Director of the safety consultancy Adelard , whose clients include the UK nuclear industry.

Professor John Knight leads the Dependability Research Group in the Computer Science Department at the University of Virgina, and is a Principal of Dependable Computing Incorporated which specialises in computing applications which have extreme consequences of failure. He leads the Helix project to design and build self-regenerative software architectures resilient to attack, for use in critical infrastructures such as energy.

I take this opportunity to thank heartily our confirmed sponsors CITEC, the Excellence Cluster for Cognitive Interaction Technology, the Faculty of Technology, both of the University of Bielefeld, the Centre for Software Reliability at the University of Newcastle upon Tyne, and Causalis Limited.

I would like to dedicate the Workshop to Professor Harold Lewis , amiable and entertaining correspondent of mine on safety and aviation safety for many years, coauthor of the fundamental document known as the “Lewis Report” on the safety of US nuclear power plants, NUREG-CR/0400, also IEEE Trans. Nuclear Science 26(5), 1979. He wrote Technological Risk (Norton, 1990, winner of the Science Writing Award 1991), and Why Flip a Coin?: The Art and Science of Good Decisions (John Wiley, 1999). Hal is unable to attend.

Folks (other than those above), please let me know by e-mail if you wish to attend. For planning purposes, please say whether you might like to present a paper or a position paper.

Peter Bernard Ladkin



The Epidemiology of Memes and its Effect upon Safety

14 04 2011

Richard Dawkins has the notion of memes. They are, crudely speaking, thoughts or ideas or ways of thinking or cultural traits, that spread through society. The idea occurs in his well-known book The Selfish Gene, published 45 years ago this year. I am interested in – and often frustrated by – the ways that ideas, particularly about safety or the lack of it, are spread or not spread. Maybe I should call it the Epidemiology of Intellectual Memes and apply for a grant.

One is the issue about ensuring continual cooling in the damaged Fukushima Dai-ichi reactors, as I noted in a previous blog post. I cite, again, Charles Perrow from
his 2007 book, The Next Catastrophe, p134:


a hurricane…could take out the power, and the storm could easily render the emergency generators inoperative as well

and p173:


..no storms or floods have as yet disabled a plant’s external power supply and its backup power generators

He is pointing out the specific vulnerability and the mechanism, four years before it happened. I don’t think that can be too strongly emphasised. (Recall also more local occurrences of concern, also referenced in earlier blog posts: tsunami expert Yukinobu Okamura’s experience at NISA in 2009 and The New York Times’s interviews with tsunami experts )

So what is going on here? Why are these memes not getting through? Why and how are they blocked?

I think it is important to understand how, in greater depth, because the success of measures to improve safety depend on the success of measures to improve thinking about safety, and if we don’t understand how accurate thinking about safety (such as Perrow’s and Okamura’s) is blocked, sometimes passively (Perrow, I take it) and sometimes actively (TEPCO’s reported response to Okamura) then we will not be able to judge whether the measures will translate into appropriate action.

Allow me a couple more safety-related but nuclear-power-unrelated examples of weird meme behavior. I shall come back to the point at the end.

Martyn Thomas and I experienced this over the years with measures for SW safety as embodied in the functional safety standard for electrical, electronic and programmable electronic systems. Certain ideas – and I mean here also some scientific results that have appeared in the literature and been widely cited – just don’t seem to get through. We are taking different approaches to the problem; the German national standardisation committee is also worried that SW is taken care of appropriately, and so I was invited to join and did, whereas Martyn is operating outside of the standardisation process, which he considers inappropriately ineffective.

I think we simply don’t understand the meme-transmission process around critical memes.

Here is another example. The British parliament is currently considering a bill which introduces specific punishment for cyclists who kill pedestrians while cycling. It made the BBC and is still on the home page of the BBC News WWW site at time of writing in this article.  Notice that the bill, if it becomes law, will have at most one application every few years. Notice also that there are many laws already on the books to deal with unlawful killing, by bicyclists and others, and as far as I know such events are pretty thoroughly prosecuted in the UK. All this effort, then, is being put into a bill whose associated law will almost never be applied, and which does not fill any ostensible gap in existing law. How does this get to be? How does it get to be supported by such a person as Stephen Glaister, Professor of Transport and Infrastructure at Imperial College London, by everybody’s tables one of the top twenty universities in the world?

The BBC quotes him as saying

Subjecting everyone who uses the public highway to the same laws might actually forge better relationships between us all and erode the idea held by many that those who travel by an alternative mode routinely make up rules of the road to suit themselves.

First, observe he is mostly concerned about a meme, not about supposed dangers posed by cyclists.

Second, everyone who lives in Britain is subject to the same laws; that is a tautology. He may mean that they are differentially enforced, and he would be right. But the solution to differential enforcement is – obviously, I should have thought – a change in enforcement policy, not a new law (which may or may not be enforced, fairly or differentially).

The BBC goes on to observe that “some bike-users reject the idea that anecdote and mutual suspicion should drive policy.” This bike-user/driver/bus-user certainly does.

Back to the point. I advocated in an earlier blog post that there should be a public, maintained safety case for every piece of critical infrastructure. Martyn has been advocating something similar for years. Nancy Leveson suggested a hazard analysis rather than a full safety case (I read this as: leave off the risk analysis). I take it this would be something similar to Lee Clarke’s Possibilistic Analysis (confirmed in private discussion), so it looks prima facie as though there might be some interdisplinary agreement on such a measure. But Nancy (op. cit.) and others privately have also pointed out it might not be workable because industry would object to the potential publication of their intellectual property, since details infringing intellectual property rights are included in most safety cases (although I do know of exceptions, such as The Pre-Implementation Safety Case for RVSM in European Airspace, which Eurocontrol put on the WWW and I claimed was flawed).

I think they are right that there would be industrial opposition and why. But it is not inconceivable that methods might be found to secure the intellectual property while at the same time reaping the benefits of public discussion of the safety/hazard case.

However, the benefits I was anticipating are based on the assumption that once something is public, it gets transmitted widely and there becomes pressure to act. That would only be true if the resulting memes are not blocked. Charles Perrow’s book is public: meme blocked. Okamura’s observation reached the place to which it was addressed: meme blocked. So a public safety case will not by itself necessarily bring benefit; one needs concomitant measures to ensure that resulting memes are not blocked, and I don’t know what those should be.



Fukushima Dai-ichi Accident: Sociologist Needed!

31 03 2011

I have been working this year with sociologists, in a research group composed largely of visitors to Bielefeld’s residential research institute ZiF. The group is working on Communicating Disaster. Then one happened – an enormous natural event triggered a disaster. Let me look at part of it, namely the system-safety disaster at the Fukushima Dai-ichi (Number 1) nuclear power station.

A nuclear power plant is what I call a teleological engineered system. Like a car, or an airplane, it has a purpose, and it is designed by one (or a few) legal actor to fulfil that purpose. As a system, it distinguishes itself from, say, a town, which is a collection of houses, shops, workshops and offices, mostly designed and constructed piecewise, for divergent purposes, indeed purposes which are often contrary, by many actors. Fukushima Dai-ichi has people swarming all over it, designing, specifying, building, operating, maintaining, and filling out all the paperwork which somehow gives us a comfy feeling of organisation aiming to fulfil the purpose. But no longer. Here it is, not producing two watts of what it is supposed to produce, but instead injuring people, threatening to distribute large amounts of its highly toxic component substances above ground, below ground, and in the water. What went wrong?

The technology behind fissive nuclear power is exothermic. The plant requires active cooling at all times, even when not operating. If it is not cooled then an accident is inevitable. Cooling requires power. When the plant is working, maybe from itself. When it is shut down, then from somewhere else. It follows that power supply must be unfailingly reliable in order to avoid an accident.

Primary power comes from outside. The existence of a secondary power system tells us that someone foresaw circumstances in which primary power would be interrupted. (They were right! An earthquake cut primary power; the live reactors, Units 1-3 of 6, shut down as planned.) Can secondary power be interrupted? If so, we need tertiary power… and so on. The tertiary power is trivial – batteries with a life of 8 hours. It follows no one thought secondary power could be interrupted for longer than that. But it was! It was taken out.

Everything else about this disaster follows from that one event: Secondary power was taken out. How? It was in a “basement”, which was flooded by the tsunami. Let us focus on the tsunami for a moment. At time of construction, it seems no one evaluated the tsunami hazard (Kopflos in die Katastrophe, Marlene Weiss, Süddeutsche Zeitung 19-10.03.2011). Later they did, but “no one thought of a tsunami that high!”. Not so – a tsunami expert brought it up at a meeting at the regulator, NISA, in 2009. He recounts that his concern was – in my words, not his – peremptorily dismissed (Japanese nuclear plant’s safety experts brushed off risk of tsunamic, David Nakamura and Chico Harlan, Washington Post, 23.03.2011). Tsunami experts have expressed their astonishment at the lack of apparent tsunami awareness at the regulator or plant operator (Japanese Rules for New Plants Relied on Old Science, Norimitsu Onishi and James Glanz, New York Times, 27.03.2011). It is important to keep in mind that this is just one way the secondary power can be taken out, but not the only way.

Engineers designing, building and operating safety-critical systems are required by standards to perform a hazard analysis (HazAn). A hazard is, roughly speaking, a precursor of an accident, so you have to know first what the accidents are – what the events are which constitute accidents. It is pretty clear to everyone in the nuclear industry that meltdown is an accident and it is equally clear that lack of cooling leads directly to meltdown. (It’s not the only one: you have to keep the spent fuel pools cooled, else they evaporate and burn. It’s clear that that constitutes an accident event also.) So losing all cooling for a long enough period of time is an event that leads inevitably to an accident. Your secondary power just cannot be taken out for longish periods of time when your primary power is not available. There, that’s (part of) a HazAn, with the derived safety requirement. HazAn is no more, and certainly no less, than this kind of reasoning, but you must systematically cover everything.

The next formal step is to ask about mitigation. What can happen to secondary power to take it out? It can fail because it is poorly maintained (mitigation: maintain it properly. This is a known quantity). It can fail because on-demand systems often fail on demand (mitigation: run it continuously, at low power, so you know it runs when it is asked to cut in). It can fail because a large airplane crashes into it (mitigation: design the building accordingly. This was a consideration for English gas-cooled nuclear plants in the early 1970′s). It can fail because of a bomb (mitigation: good security at the gates and perimeters). It can fail because it’s flooded. Before someone says “thousand-year tsunami”, recall that there are two and a half million gallons of water perched in the air in the spent-fuel pools of the six reactors, which pools just might be breached during an earthquake – but weren’t, as it turns out. You should think of that, even if a tsunami doesn’t occur to you. (Mitigation: design the secondary power to function while submerged. They do it in submarines, this is a known quantity.)

Maybe such HazAns weren’t state of the practice when the plant was built decades ago? HazAns are also required by standards during operations, which were continuing up to March, 2011.

But no one can think of everything!” That is, though, the purpose of a HazAn. You may make a mistake, of course, in your HazAn. But the reasoning above is routine, one thing following from another; I would require from my students no less.

Now to the point of this shaggy dog story. How did the builders, owners and operators of this plant miss all this for forty years? To answer that question, you don’t need an engineer, you need a sociologist! There, I said it!

Do you need to answer it? Most certainly you do. It helps you to find other plants, other power companies, where similar things could have happened and could be happening, so we can step in before something equally extreme happens.

You also need somebody to tell you what the consequences of such an extreme event are. Engineers work on experience. Commercial jet transport airplanes are thought of, justifiably, as maybe the most highly reliable complex artifacts ever built. Wings used to fall off (say, from Wellingtons, seventy years ago). They don’t any more (or only as a consequence of some other unrecoverable event). Experience makes the difference: we have five to twenty fatal accidents with commercial jet airliners per year to learn from. Compare with nuclear power: we have had three, maybe four, extreme events in fifty years (Windscale, maybe Three Mile Island, Chernobyl, Fukushima 1). Who can tell us what the consequences are? Two engineering colleagues said: Chernobyl, 60+ fatal. Some medical researchers say: 6000+ fatal. Greenpeace says: 200,000+ fatal. If the weather had been different, maybe tens of thousands more in Kiev. When the serious estimates of fatalities (alone! Then there is the damage to the environment to consider) differ by four orders of magnitude, as here, then the answer seems to be that no one can tell us reliably. Or even what the possible consequences are. The engineering risk calculus of probability times severity doesn’t work, either. It gives one answer before Chernobyl, another answer after Chernobyl, and yet another answer after Fukushima. A decision aid is useless if it gives you different answers each time you have an unwanted event. An engineer can’t tell you.

Can a sociologist tell us? Maybe not. Then who?

Acknowledgement

I thank Lee Clarke, who has a note at nj.com, Charles Perrow, who pointed out the susceptibility of the design to flooding secondary power, Bernd Sieker, who as usual delved into the physical details of everything, and Werner U., who has been scouring the press, and the participants of the ZiF research group Communicating Disaster for useful comments on the first version of this note.



Fukushima, the Tsunami Hazard, and Engineering Practice

27 03 2011

The conclusion first, as well as at the end. For safety-critical infrastructure, there should be required a continuously-maintained, public safety case. Members of the public may at any time look it up. A wise government will make provision for commentary and rework where necessary.

I am well aware that this sets the importance of a safety case differently from that suggested by Charles Haddon-Cave in his inquiry into the RAF Nimrod accident. This is a different case. The UK MoD is a closed organisation and I am talking about critical public infrastructure.

I am running a private discussion group on the Fukushima accident. One of the main questions, raised by sociologist Charles Perrow on the Monday after it happened, is why on earth was backup power put in a place at which it could be incapacitated by a common-cause event (Perrow phrased it somewhat differently). He suggested it was a design accident, not a “normal accident” in his technical use of that phrase.

I thought there had been an obvious failure of hazard analysis (HazAn), which is a required step (rather, series of steps) in development and deployment of most safety-critical systems. I thought the idea of a public safety case was a useful suggestion even then. It was partly based on news at the time that tsunami researchers had recently discovered evidence of a comparable historical tsunami in the area some 1200 years ago.

But it turns out to be worse than that.

On Wednesday, the Washington Post contained reports of comments at a NISA meeting in 2009 by a tsunami expert, Yokinobu Okamura, who brought up the issue of tsunamis, and, reading between the lines, was peremptorily dismissed.

But it turns out to be much worse than that.

The NYT contains the story today.

* The word “tsunami” did not appear in government guidelines until 2006.

* People have been saying “well, it was a big quake!”, but it turns out one of magnitude 7.5 would have sufficed to breach the high-water defences at the plant.

* Recommendations in 2002 led TEPCO to raise its “maximum projected tsunami” to 17.7-18.7 feet, which was higher than the 13-ft bluff on which the plant is built. Yet all they did is to raise an electrical pump 8 inches.

Here is the text

Japanese government and utility officials have …. said that engineers could never have anticipated the magnitude 9.0 earthquake — by far the largest in Japanese history — that …. generated the huge tsunami. Even so, seismologists and tsunami experts say that according to readily available data, an earthquake with a magnitude as low as 7.5 …. could have created a tsunami large enough to top the bluff at Fukushima.

After an advisory group issued nonbinding recommendations in 2002, Tokyo Electric Power Company, the plant owner and Japan’s biggest utility, raised its maximum projected tsunami at Fukushima Daiichi to between 17.7 and 18.7 feet — considerably higher than the 13-foot-high bluff. Yet the company appeared to respond only by raising the level of an electric pump near the coast by 8 inches, presumably to protect it from high water, regulators said.

Then there is some further wonderful stuff on how hazards were thought about, in the following quote.


“We can only work on precedent, and there was no precedent,” said Tsuneo Futami, a former Tokyo Electric nuclear engineer who was the director of Fukushima Daiichi in the late 1990s. “When I headed the plant, the thought of a tsunami never crossed my mind.”

1. If one is following safety-engineering practice, one is supposed to work on a HazAn, not on “precedent”, whatever that might be.

2. Tsunamis never thought of? How about performing a HazAn? Then maybe there is somebody in the room, say by the name of Yokinobu Okamura, who does.

3. And when the question is raised, finally in 2009, why is a dismissive reply acceptable? Is that the way continuous hazard assessment is performed in Japan? When they perform an FMEA, do they just look at the system and not at the system environment? Let me recommend our course on how to perform HazAns. It is System Safety and Security 2 in our university catalog and we give it every year.

The NYT article makes it clear that TEPCO and NISA were well aware that they were not always sufficiently prepared.

…. For decades …..Japanese officialdom and even parts of its engineering establishment clung to older scientific precepts for protecting nuclear plants, relying heavily on records of earthquakes and tsunamis, and failing to make use of advances in seismology and risk assessment since the 1970s.

For some experts, the underestimate of the tsunami threat at Fukushima is frustratingly reminiscent of the earthquake — this time with no tsunami — in July 2007 that struck Kashiwazaki, a Tokyo Electric nuclear plant on Japan’s western coast.. The ground at Kashiwazaki shook as much as two and a half times the maximum intensity envisioned in the plant’s design, prompting upgrades at the plant.

“They had years to prepare at that point, after Kashiwazaki, and I am seeing the same thing at Fukushima,” said Peter Yanev, an expert in seismic risk assessment based in California, who has studied Fukushima for the United States Nuclear Regulatory Commission and the Energy Department.

TEPCO and NISA knew in 2007 that their hazard criteria needed review. Presumably this was the reason for the meeting that Okamura attended at which his question was trivially rebuffed.

And now for what was known about tsunamis by the scientific establishment. And what TEPCO did.


When Japanese engineers began designing their first nuclear power plants more than four decades ago, they turned to the past for clues on how to protect their investment in the energy of the future. Official archives, some centuries old, contained information on how tsunamis had flooded coastal villages, allowing engineers to surmise their height.

So seawalls were erected higher than the highest tsunamis on record. At Fukushima Daiichi, Japan’s fourth oldest nuclear plant, officials at Tokyo Electric used a contemporary tsunami — a 10.5-foot-high wave caused by a 9.5-magnitude earthquake in Chile in 1960 — as a reference point. The 13-foot-high cliff on which the plant was built would serve as a natural seawall, according to Masaru Kobayashi, an expert on quake resistance at the Nuclear and Industrial Safety Agency, Japan’s nuclear regulator.

Eighteen-foot-high offshore breakwaters were built as part of the company’s anti-tsunami strategy, said Jun Oshima, a spokesman for Tokyo Electric. But regulators said the breakwaters — mainly intended to shelter boats — offered some resistance against typhoons, but not tsunamis, Mr. Kobayashi said.

……….

Two independent draft research papers by leading tsunami experts — Eric Geist of the United States Geological Survey and Costas Synolakis, a professor of civil engineering at the University of Southern California — indicate that earthquakes of a magnitude down to about 7.5 can create tsunamis large enough to go over the 13-foot bluff protecting the Fukushima plant.

Mr. Synolakis called Japan’s underestimation of the tsunami risk a “cascade of stupid errors that led to the disaster” and said that relevant data was virtually impossible to overlook by anyone in the field.

………………

…… even through the narrow lens of recorded tsunamis, the potential for easily overtopping the anti-tsunami safeguards at Fukushima should have been recognized. In 1993 a magnitude 7.8 quake produced tsunamis with heights greater than 30 feet off Japan’s western coast, spreading wide devastation, according to scientific studies and reports at the time.

On the hard-hit island of Okushiri, “most of the populated areas worst hit by the tsunami were bounded by tsunami walls” as high as 15 feet, according to a report written by Mr. Yanev. That made the walls a foot or two higher than Fukushima’s bluff.

But in a harbinger of what would happen 18 years later, the walls on Okushiri, Mr. Yanev, the expert in seismic risk assessment, wrote, “may have moderated the overall tsunami effects but were ineffective for higher waves.”

And even the distant past was yielding new information that could have served as fresh warnings.

Two decades after Fukushima Daiichi came online, researchers poring through old records estimated that a quake known as Jogan had actually produced a tsunami that reached nearly one mile inland in an area just north of the plant. That tsunami struck in 869.

To my mind, this catalog of astonishing engineering practice makes the case for a continuously-maintained, public safety case for safety-critical infrastructure-components to be overwhelming.

There were lots of people around who knew about tsunamis, and were prepared to say. Had TEPCO been required publically to justify any countermeasures it had implemented, then I imagine the inadequacy of the case would have been apparent to any high-school student who decided to look at it for her public affairs class, let alone geologists, hydrologists, or other engineers.

PBL