Saying the Wrong Thing

28 03 2013

The Guardian yesterday wrote an encomium to the UK government’s Chief Scientific Advisor Prof. Sir John Beddington (I hope they don’t mind that I quote in full):

Politics may not be the enemy of scientific method, but they are hardly intimate friends. Science inches along by experiment, evidence and testing (and retesting); politics is often about bold moves executed on personal judgment. So the chief scientific adviser to the government has his or her work cut out. But John Beddington, who has held the post since 2008 and retires this month, has trodden a thin line with grace. Three crises broke on his watch – the Icelandic volcano eruptions, Fukushima and ash dieback disease – and in each he showed a useful caution: compare the political hysteria over Fukushima in Germany with the calm that prevailed here. Mr Beddington has also been an advocate for science, by spearheading the push to install a chief scientist in each Whitehall department. And in raising the alarm about “a perfect storm” of rising population, falling energy resources and food shortages, he did the right and brave thing.

Concerning what he said on Fukushima, I wrote to the ProcEng list on 16.03.2011:

………. (BBC tweet at 1431): “The UK Government’s Chief Scientific Officer, Prof John Beddington, has sought to allay fears of radiation exposure. He told a press conference at the UK embassy in Tokyo: “What I would really re-emphasise is that this is very problematic for the area and the immediate vicinity and one has to have concerns for the people working there. Beyond that 20 or 30 kilometres, it’s really not an issue for health,” he says. The full and very interesting transcript is available on the embassy’s website.”
The key phrase, for those not familiar with British modes of writing, lie in the phrase “very interesting”. I infer that the BBC thinks Beddington['s comment is contentious]. …..

The Guardian cites three well-known events. The Icelandic volcano eruption and the ash-dieback event pose/posed no threat to human life and very little to general human well-being as broadly construed. The British air traffic service provider reaction to the Icelandic volcano eruptions was exemplary, in particular in face of the engineering uncertainty and the pressure from the airlines.

However, the Fukushima event involved some considerable danger to people. He got that wrong, contrary to what The Guardian suggests. At the time he was making his soothing statement above, the Japanese government itself, extremely concerned about the lack of reliable information on the accident it was receiving from TEPCO, was discussing plans to evacuate Tokyo. And not even TEPCO had an accurate idea of how dangerous the circumstances were. The event at Fukushima, as we now know, could have been very much worse than it was and is, and, even though we were spared the very worst, it still could be worse than we think. Sir John, a population biologist and not a safety engineer, was inadvertently misleading his audience on a matter concerning danger.

That is, of course, one of the disadvantages of the job, when one must make public pronouncements on matters on which one is not especially expert. But I wonder why he had not received better advice?

Moving on, it is hard to leave this particular comment of The Guardian alone:

compare the political hysteria over Fukushima in Germany with the calm that prevailed here

The Guardian calling the German reaction “political hysteria” is just silly. There is considerable and long-standing political opposition to nuclear power here in Germany, including a permanent platform from a major party who has been in government, namely the Green Party. Chancellor Merkel simply adopted the Green Party platform, whereas her party had previously been “for” continued use and further building of nuclear power stations. That is normal democratic, opportunistic, representative politics. Considering that the building and use of nuclear power stations involves large amounts of taxpayers’ money being paid to private corporations – in Germany’s case, to assure them a “reasonable profit” to which they claim they have a legal right – there is a moral obligation for politicians to pay significant attention to what ordinary people think on the matter, and some evidence that, apart from the Green Party, they had not been doing so. (A more detailed comment from TheRealPM is on the Guardian page.)

Lest we forget, nobody, not even Germany, has solved the problem of what to do with the waste. It’s fifty years and counting. Someone will have to think of something soon.



Root Cause Analysis

5 02 2013

The International Electrotechnical Commission, IEC, is currently preparing an international standard to be known as IEC 62740 Root Cause Analysis. I prepared some material for potential inclusion in the standards document but as of writing it appears it will not be used. I think it is quite useful, so I make it hereby available.

The paper on the RVS WWW-site, Root Cause Analysis: Terms and Definitions, Accimaps, MES, SOL and WBA, consists of

  • a vocabulary I put together defining the terms I think are needed to talk effectively about root-causal analysis, based on the International Electrotechnical Vocabulary, IEC 60050, which all international electrotechnical standards are required to use. I am not completely happy with a variety of the definitions of fundamental concepts in the IEV. I make my discontent clear through notes which I have added to the IEV definitions. Other concepts are new, and not (yet) in the IEV. Readers might like to compare with the vocabulary which I prepared in 2008 for system safety uder the auspices of Causalis Limited, Definitions for Safety Engineering.
  • Brief introductions to the root cause analysis methods for accidents, Accimaps (from Jens Rasmussen, successfully applied by Andrew Hopkins and now the Australian Transport Safety Board in Australia), Multilevel Event Sequencing (MES, from Ludwig Benner, Jr. formerly of the US National Transportation Safety Board), Safety through Organisational Learning (SOL, from Babette Fahlbruch and SOL-VE GmbH, used in the German and Swiss nuclear industries), and Why-Because Analysis (WBA, originated by me and developed by colleagues at Uni Bielefeld RVS and Causalis Limited, used by two divisions of Siemens and now the German Railways DB, as well as Causalis for its accident analyses for clients). Each method description includes pictures, so readers get an idea of the presentation of results, a short section on process – what one does, and a section on strengths and limitations.

I think it would be a good think to have similar descriptions for all methods in current industrial use for root cause analysis of significant incidents. My personal list of such methods stands currently as follows:

  • Accimaps (in the document)
  • Barrier Analysis. BA is really an a priori method favored in the process industries, but also used post hoc to determine which barriers failed and why. Typified in Reason’s “Swiss Cheese” diagram.
  • Causes-Tree Method (CTM). Widespread and, I am told, sometimes legally required in France for accident analysis.
  • Events and Causal Factors (ECF) Analysis and Diagrams. ECF is dealt with extensively in Chris Johnson’s Failure in Safety-Critical Systems: A Handbook of Accident and Incident Reporting
  • Fault Tree Analysis (FTA). I had considered FTA primarily an ab-initio risk-analysis method at system design, but Nancy Leveson tells me she has seen more root cause analysis performed with the help of fault trees, sometimes put together after an incident rather than pre-existing, than with any other technique.
  • Fishbone or Ishikawa Diagrams. These are minimally a method, more a presentation technique, and not one I find particularly helpful. More applicable in industrial quality control than in significant-incident analysis, I would think.
  • Multilevel Event Sequencing (MES, and its associated technique STEP), in the document
  • The Reason Model of human operational analysis, involving human error in operations, classification such as skill-based, rule-based and knowledge-based operations (SRK), the notion of latent errors, or misdesign of operations allowing mishap sequences to occur normally, the “Swiss Cheese” model.
  • Safety through Organisational Learning (SOL, with its associated toolset SOL-VE), in the document.
  • STAMP and its associated methods, Leveson’s feedback-control-system model of critical-operational control, applied to the Rasmussen-Svedung hierarchy of operational, organisational and institutional context, dealt with extensively on Nancy Leveson’s WWW site
  • TRIPOD, a method developed over many years by oil companies in cooperation with Jim Reason’s group, and in wide use in the oil industry
  • Why-Because Analysis (WBA), in the document.

Besides these, there are special methods for root cause analysis of incidents involving human operations; maybe one can call these “human factors root cause analysis” methods. Amongst these are:

  • Connectionism Assessment of Human Reliability, CAHR, from Oliver Sträter’s group at Kassel, which has been used in analysing marine accidents and incidents.
  • Human Information-Processing Models. These originated with Peter Lindsay and Don Norman, include methods sometime used by NASA’s human factors research group (NASA Ames, at Moffett Field in California). Our PARDIA classification is such a model.
  • Human Factors Analysis and Classification System (HFACS).
  • Management Oversight and Risk Tree (MORT), developed by William Johnson for the US Nuclear Regulatory Commission and widely used in the US nuclear industry.
  • The SHEL model (note that the referenced page spells it mistakenly with two “l”s).
  • Shorrock and Kirwan’s TRACEr model for identifying and classifying cognitive error in air traffic management and control operations. For example, see this paper.

There are other promising methods which I could include, but I don’t know how much industrial “traction” they yet have. If readers could let me know of other worthwhile methods which have found some foothold in industry, I would be grateful. I would be even more grateful for descriptions of methods similar to those that are already in the document! Authorship will of course be acknowledged in the usual manner.



Aerial Collision Avoidance

9 12 2012

Just over a decade ago, in July 2002, there was a catastrophic mid-air collision of a Russian passenger aircraft heading westwards and a freighter aircraft of DHL heading northward, near the town of Überlingen on Lake Constance (Bodensee) in Southern Germany near the Swiss border. I wrote a paper on it about a month later, ACAS and the South German Midair, RVS Technical Note RVS-Occ-02-02, on 12 August 2002, in which I suggested that there were issues concerning the verification of the algorithms used in TCAS, as well as the assumptions about cockpit decision-making upon which the successful use of TCAS depends.

In May 2004 the final report of the investigating body, the German BFU, was published. It is 114pp long in english, without the appendices. There are mistakes in it, one of which I had already anticipated in my August 2002 note. I then wrote a paper based on my 2002 note, which accompanied an Invited Talk I gave at the Ninth Australian Workshop on Safety-Related Programmable Systems in Brisbane, Australia, in 2004, Causal Analysis of the ACAS/TCAS Sociotechnical System. This paper is also available on the Publications page of the RVS WWW site.

Neale Fulton, a colleague at the state research agency CSIRO in Canberra, who has been working on algorithms for proximity/collision avoidance for some years, recently told me of a paper by Peter Brooker, in the journal Safety Science 46(10), December 2008, entitled The Überlingen Accident: Macro-Level Safety Lessons, which refers to my work. That’s four years ago. Brooker apparently says some things about my work.

I haven’t seen the paper. Gone seem to be the old courtesies by which one forwarded a copy of an academic paper to a colleague whose work was discussed. Our library used to subscribe to the journal, until 2002, but I suppose it became too expensive. It is certainly expensive now: the publisher Elsevier wishes to charge me (or my library) €31.50 for this paper of about 15 pages. As I have said before, I don’t agree with the current commercial politics of many academic publishing houses. Not all authors do as I do to ensure that some version of a published paper appears for free on a WWW site under the auspices of the taxpayer-funded organisations who pay me a salary for this work. I hope Professor Brooker will understand me seasonally donating to charity the €31.50 I have saved by not buying his paper.

Brooker says some odd things about my work. Also, in 2008 the TCAS standard was amended. So it seems time to revisit those considerations.

There is now a TCAS II Minimal Operational Performance Standard RTCA/DO-185B. There is an FAA Technical Standard Order (TSO) TSOC119c, and an EASA TSO ETSO-C119c, corresponding to TCAS II Version 7.1, as it is now called, which includes two changes, detailed in Change Proposals CP112E and CP115, as in this Honeywell white paper. CP112E is directly relevant to the Überlingen accident, as below.

There are three main points which I wish to address again.

First, I pointed out in my 2004/5 paper (Section 3) that use of TCAS played a direct causal role in the accident. To phrase it technically, the use of TCAS was a necessary causal factor in the collision. I proved this by means of the Counterfactual Test. However, amongst the probable causes which the BFU report lays out, this factor is missing. That is a logical mistake.

I still encounter many technical people in aviation who refuse to accept this observation. I fail to understand why the proof is not routinely accepted. Instead, few seem to want to say in public that use of TCAS was a necessary causal factor in the accident. Maybe politics and wishful thinking triumph over logic once again?

Second, my Issue 4.1 of the paper concerns the fact that the Reversal RA mechanism apparently did not operate as it should have. I labelled this a requirements problem. The design of the kit did not operate in the way the requirement intended. People have waffled about this too, but here is the BFU report telling us that the failure to issue a Reversal RA was a necessary causal factor of the collision according to the Counterfactual Test:


A Eurocontrol specialist team has analysed the accident based on three TCAS simulations. Three different data sources and two different analysing tools for TCAS II were used. It is the BFU’s opinion that the following important insights can be drawn from the Eurocontrol study:
The analysis confirmed that the TA’s and RA’s in both airplanes were triggered according to the design of the CAS-logic
The simulation and the analysis of the alert sequence showed that the initial RA’s would have ensured a safe vertical separation of both airplanes if both crews had followed the instructions accurately.
Moreover, Eurocontrol conducted a further analysis how TCAS II would have reacted in this case with the modification CP 112 which had already been developed prior to the accident. According to the results provided, TCAS would have generated a Reversal RA after the initial RA which would have led to a sufficient vertical separation of both aircraft if the Boeing B757-200 [the DHL freighter] crew would have reacted according to the Reversal RA.

Despite this clear statement, this necessary causal factor did not appear amongst the causes in Section 3 of the BFU report.

In fact, it was known to Eurocontrol in 2000 that Reversal RAs did not function as desired. In engineering-scientific parlance, the design of TCAS did not fulfil its requirements specification. Eurocontrol filed a change notice with the committee, CP 112, to get this fixed. Two years later, there occurred the Überlingen collision. Two years after the problem was first openly acknowledged. Then there were other near-misses, detailed in the Eurocontrol SIRE+ project. Finally, in 2008, RTCA accepted the amended CP 112+ as well as another Change Proposal, resulting in TCAS II Version 7.1 (some issues are detailed in the document Decision criteria for regulatory measures on TCAS II version 7.1 by Stéphan Chabert & Hervé Drévillon).

The anomaly was known in 2000. A major accident in which it was a causal factor occurred 2002. The change was made in 2008. I think it is a scandal that it took so long to remedy this anomaly and that so many were killed on the way.

Third, Issue 4.5 of my paper concerned the cognitive state of the operators (the crews) and the decisions they took. I used an analysis method which I called the Rational Cognitive Model (RCM). Intuitively, it works like this. Suppose the operators were replaced by perfect robots with the same cognitive information and programmed with the TCAS operator procedures, as well as algorithms to make decisions according to the information and procedures. What would the robots do? I pointed out that the robots piloting the Russian aircraft might well have chosen to descend, as the Russian crew did, and for which they have been roundly criticised by all and sundry.

I have subsequently looked at various sociotechnical interactions using RCM. A number of them are analysed in Verbal Communication Protocols in Safety-Critical System Operations, a chapter in the Handbook of Technical Communication, Mouton-de Gruyter, 2012. I have also analysed road accidents, including multiple-vehicle pile-ups on motorways in fog, in The Assurance of Cyber-Physical Systems: Auffahr Accidents and Rational Cognitive Model Checking, which was supposed to be a chapter of a book. I applied RCMs subsequently to same-direction road traffic conflicts (as a bicycle rider, and not necessarily a slow one, I have plenty of experience to draw on). The paper is not yet available.

Ten years on, it is instructive to see how far we have come. I suggested that TCAS be verified using Rational Cognitive Model Checking (RCM-checking). RCM-checking consists in enumerating all the configurations which can occur and determine that the desired operator behaviour under decision-making gives the right outcome. I exhibited in my 2002 note and 2004 paper, and again explicitly in the 2012 Handbook chapter, a situation in which this “right outcome” cannot be assured, namely the Überlingen situation. The 2012 Handbook-chapter formalism makes clear this is (small) finite-state-machine calculation, well within the ability of existing model checkers.

However, verifying a specific scenario for correctness or anomaly is clearly easier than running through all possible scenarios to check. Could current automated model-checkers check and verify all such states for a given system such as TCAS? I put this question to John Rushby, who has applied model checking in similar situations. Say, his paper from 2002 on Mode Confusion and other automation surprises), of which I saw the original contribution in Liege in 1999. John has been at it three years longer than I, although I did have a go at Ev Palmer’s “Oops” example also using WBA and PARDIA in 1995-6. The latest version of John’s work with Ellen Bass, Karen Reigh and Elsa Gunter is from 2011. John suggested that checking large numbers of RCMs (say, more than 50 or so different scenarios) might well be difficult with current model checkers.

I am disappointed at the meagre take-up of these model-checking approaches to algorithms involving cooperative operator behavior. The technical material involved is not so very hard – every digital engineer nowadays has to deal with FSMs. Maybe a problem lies in that people still do not consider operator procedures subject to the same kinds of verification as other algorithms. Maybe this will change as more and more robots come “on-line” to replace humans in various activities. The safety of their interactions is surely governed by the international standard for functional safety of E/E/PE systems, IEC 61508, although for industrial fixed-base robots a new international standard is being developed. IEC 61508 requires assurance measures; maybe this will prompt interest in verification.

There are apparently still intellectual hurdles to overcome. One seems to lie in persuading people that sociotechnical procedures can be verified in the rigorous way it is (sometimes) done in informatics. Another is apparently to persuade them that this would yield any advantage. Which brings me to Brooker’s paper. Neale sent me an excerpt. Brooker takes exception to what I suggested should be done, namely
1.Check and fix the Reversal RA misfit so that design fulfils requirement.
2.Check the interaction between ACAS and Reduced Vertical Separation Minima (RVSM) more thoroughly
3.Determine precisely in which circumstances ACAS algorithms are correct, and where they fall short.
4.Deconflict requirements and advice to pilots on use of ACAS.
5.Causally analyse the operator interactions using Rational Cognitive Models and decision theory.
6.Analyse carefully what happens when one actor has a false model of system state.

Brooker’s comment on all this: “some of Ladkin’s recommendations may not be very wise”.

Huh?

Brooker explains how he comes to this conclusion by means of an analogy. He discusses in a couple of paragraphs a situation in Ancient Rome, whereby bricks would fall off buildings onto or near passers-by. Apparently wives would push their husbands out of the way. He discusses some decision-theoretic aspects of, well, pushing one’s husband out of the way (as opposed, one might think, to pushing him under).

No arguments for relevance of this situation to that of ACAS are proffered.

So I have to look for clues around and about. Brooker says: “Ladkin says that it ‘‘should be precisely determined in which circumstances ACAS algorithms are correct and in which circumstances they fail.” But the first task is precisely what has been done under ICAO’s auspices for decades (Carpenter, 2004)”. I take it from this suggestion that Brooker has little idea of what is involved in verifying algorithms, as that term is understood in informatics. And I take it he is not familiar with my work, despite citing me, or that of Rushby.

I recommend that people take a look at Fulton’s work on collision-avoidance to see what such algorithm verification might look like. And, for those who are unfamiliar with it, at Rushby’s and my work to see some ways of verifying procedures involving operator decisions.

As I indicated, I think that the poor TCAS/ACAS engineering standards which were causally involved in the deaths of 70-odd people ten years ago are a scandal, as is the fact that it took a further six years for them to start to be fixed. We are on the way to developing techniques which can be used to avoid such poor engineering in the future. I think that work should be encouraged. I don’t see any point in denigrating that endeavor through facile commentary.



Recharging Electric Road Vehicles

31 10 2012

I chair a group of specialists (electrical engineers, safety analysts, others) mandated by the German electrical-engineering standardisation organisation DKE to undertake a risk analysis of the process of recharging electric road vehicles.

We have been working now for close on one and a half years, on conductive charging, and have a document under internal review purporting to offer a high-level risk analysis of recharging using so-called “Mode 3”, in which a charging station permanently attached to the ground or to a structure is used. This mode offers charging-service providers and equipment providers the widest scope to ensure safety of the charging process, because anything considered necessary to assure an appropriate degree of safety (“safety functions” in the lingo of IEC 61508) can be built in to the box.

Other modes are Mode 2, in which a “box” with appropriate circuitry and safety mechanisms is built into the cable used for charging a vehicle, while the cable itself plugs straight in to building circuitry; and Mode 1, in which a charging cable is attached at one end to the vehicle and at the other to building circuitry, without any intermediating electrics or electronics.

The Renault Twizy car has a cable in front allowing Mode 1 charging (also Mode 3) through a normal “SchuKo” plug (“SchuKo” is short for “Schutz-Kontakt”, which means “contact-protected”, the usual kind of household plug through which current cannot flow until the person handling the plug is physically separated from live parts).

Inductive charging is somewhat further in the future.

The method we are using is a mix of OHA and HazOp. The OHA part is to consider the entire connected chain as a system, consisting of objects (subsystems)

  • grid supply
  • fixed charging column with connection to grid
  • charging column/charging cable interface (plugset)
  • charging cable
  • charging cable/ vehicle interface
  • vehicle

and to define the properties of and relations between these objects which we consider relevant to safety properties. We use the HazOp guideword process to extend the set of properties to consider and to guide us to possible hazard situations. We associated each hazard specifically with one of the subsystems involved in it.

We then used event trees to estimate the severity (worst-possible outcome) of each hazard. We were concerned with outcomes “electric shock” (to a person) and “fire”. We consider electric shock to a person to be at worst immediately deadly, and fire less so because a person has a certain possibility in general to extricate himherself from a fire situation. We evaluated each hazard as to whether it was unforeseeable, theoretically possible, or plausible.

There are a number of memes concerning this task which I think would like to introduce into discussion amongst safety specialists. I would like to ask for any of your thoughts on the following memes. I would like to share some thoughts transparently with colleagues, and wish to give appropriate credit for contributions, so I would be grateful if you would indicate whether your name, with or without your affiliation, may be associated with your view or whether you wish your comment to be anonymous. My email address is ladkin”AT”rvs”DOT”uni-bielefeld”DOT”de.

Meme 1. Electric vehicles are no different from other devices, for example lawnmowers, in the business of being attached to the grid. The same issues arise with electric vehicles as with lawnmowers: no more nor less.

PBL: I strongly don’t agree with this assertion. Electric road vehicles store large amounts of power in batteries; lawn mowers don’t. This power could theoretically, through malfunction, be discharged into the circuit to which it is connected: lawnmowers cannot do this. This power could also intentionally be available to power building circuits; lawn mowers cannot offer this.

Meme 2. Any risks resulting in electric shock or fire resulting from charging an electric vehicle on a household or building circuit are already known, and have been for decades.

PBL: I have not seen a proof of this assertion. Surely, to prove this assertion it is necessary to perform a risk analysis? Before ours, to my knowledge, one has not been performed.

Meme 3. Any risks resulting in electric shock or fire resulting from charging an electric vehicle on a household or building circuit are fully covered by an adequate set of electrical standards.

PBL: I have not seen a proof of this assertion. Surely, to prove this assertion it is necessary to perform a risk analysis and to see explicitly that all purported risks are already covered in the existing standards?

Meme 4. The term “risk analysis” gives lay people who might buy them the impression that there are risks associated with electric vehicles and so the term should be avoided at all costs.

PBL: There are obviously risks associated with any road vehicles including electric ones. The term “risk analysis” is a technical term denoting a specific kind of analysis which is required by IEC Safety Guide 51 to be required to be performed in any standard which concerns safety of equipment. I do not agree with avoiding precise, universal technical terms because they might in some way “scare” lay people. I suggest, instead, explaining what the technical term means and that such analysis is part of defined best-practice.

Meme 5. Any risks associated with the electric vehicle are covered by the requirements of ISO 26262 (governing the functional safety of road vehicle E/E/PE systems). Any risks associated with the charging system are covered by the requirements of IEC 61508 (governing functional safety of E/E/PE systems). Therefore any risks of charging such vehicles are fully covered.

PBL: There are two mistakes here.

First is to argue from the Premisses that (a) the risks involving in using System A are known, and (b) the risks involved in using System B are known, to the Conclusion (c) that the risks in using A-composed-with-B are known. Counterexamples abound.

Second is to think that IEC 61508 (indeed ISO 26262) works like, say, an electrical-safety standard: that if you do this-and-this everything will be alright. IEC 61508 specifies how care is to be taken, and what analyses are to be done, in designing and operating safety-related E/E/PE kit. It does not, and cannot, guarantee any specific outcome (such as freedom from accidents); whereas standards in electrical safety are intended to guarantee freedom from electric shock.

Meme 6: There are no risks associated with maintaining and operating electric road vehicles that are not also associated with maintaining and operating gasoline-powered road vehicles.

PBL. This is obviously not true.

For example. the possibility of a dangerous electric shock from an electric road vehicle is obviously different from the possibiity of a dangerous electric shock from a gasoline-powered road vehicle.

A second example: gasoline-powered cars are refueled on separate spaces set aside for this very purpose from the road, called gas stations or petrol stations, and behavior on or around them is controlled. Dangerous accidents with speeding vehicles are unlikely. Whereas “refueling” electric road vehicles is proposed while the vehicle is parked on the public road – indeed we have two such recharging points in Bielefeld. Vehicles parked on the public road are more susceptible to involvement in higher-speed collisions with their ensuing damage.

A third example: damaged electric road vehicles have been known to burst into flames many days or weeks later. Luckily, known instances have been test cars at storage sites.

A fourth example: batteries in some electric road vehicles are susceptible to thermal runaway. Much smaller batteries in most gasoline-powered vehicles are not.

Meme 7: The risks associated with maintaining and operating electric road vehicles are equivalent to those associated with maintaining and operating gasoline-powered road vehicles.

PBL: The word “equivalent” here has an unclear meaning. Suppose it is to be given a precise meaning (say, chances of death or serious injury). Then surely a risk analysis, of which a risk analysis of recharging electric road vehicles is part, must be performed in order to be able to draw such a conclusion.

Meme 8: A risk analysis without listing the possible causes of the hazards is not helpful.

PBL: There may be many and varied causes of a hazard. For example, damaged electronics which lead to a later disadvantageous effect on behavior. How could electronics be damaged in such a way? There are quite a lot of examples in the literature. Maybe Kevin Driscoll’s slide show Murphy Was An Optimist, Version 19 of which is at http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf , is a good place to start. What one really wants to do as the result of a risk analysis is to reduce the risk. One way of doing that may well be mitigated the hazard by hindering the most deleterious consequences given that it has occurred. Given the variety of damage that might be caused to electronics, maybe in ways we haven’t thought of yet, indeed, given that it is an uncompleted major project of one of the leading researchers in the field, listing all the specific causes and the damage that ensues seems to me less helpful for the task of risk-assessing recharging operations than abstracting and considering what might result from any situation in which there is “damaged electronics whose behavior is different from that required and expected”.

Meme 9: These issues are concerned with electrical safety. Functional safety has no role to play.

PBL: As these technical terms are defined, electrical safety is part of functional safety for E/E/PE equipment.

Acknowledgement: Thank you to Bernd Sieker for commentary and critique.



Concerns About Spent Fuel Pool 4 at Fukushima Daiichi

5 06 2012

In Risks-26.86, Tobin Macginnis pointed to a Japanese documentary on the continuing dangers of SFP4, via Dave Farber’s IP list and PGN’s redaction. In Risks-26.87, Dan Yurman claimed in response that

this nonsense has been thoroughly debunked by a special post at the blog of the American Nuclear Society

as well as

Scare the socks off people propaganda is never a substitute for engineering reality. You might just as well try to build railroads on snow drifts

He linked to the post, by a former navy nuclear technician Will Davis. When you look at the post, please do note the URL: “spent-fuel-at-fukushima-not-dangerous“. What guff! Of course it’s dangerous. The actual written headline is more benign: “Spent fuel at Fukushima Daiichi safer than asserted“.

Yurman’s claim of “propaganda” got my goat, for his post itself seemed to me to be little more than that. I sent PGN and Yurman a message saying so. Yurman responded that

No one on the [ANS Fukushima commentary] team is interested in propaganda. The article went through two rounds of fact checking.

I replied that I thought he (and Davis) were ignorant of basic safety engineering techniques and suggested

* he [and colleagues at ANS] perform a hazard analysis, followed by

* enumerating the worst-case outcome from each hazard identified, and

* giving some kind of assessment of the chance that that worst-case outcome will be realised

Yurman replied that he was sorry to see that I had “chosen to make emotional insults over engaging in dialog“.

Such reactions are why I prefer to avoid such “dialog”. Yurman had publicly asserted that people worried about the worst-case outcome of an SFP4 structural failure were engaging in “propaganda”. When I suggest he was ignorant of system safety techniques and might like to try a hazard and risk analysis, he takes that as an insult. It is rather a statement of fact, followed by a sensible suggestion. He is right about the emotion, though – I strongly believe that people who comment in public on matters of engineering detail should both possess and use the appropriate engineering knowledge, and I didn’t think either Yurman or Davis were exhibiting it.

The steps above are recommended by ISO/IEC Guide 51: Safety aspects – Guidelines for their inclusion in standards, 1999. Guide 51 says that a hazard analysis should be performed, followed by an assessment of the risk, and a step to introduce measures for risk reduction (mainly avoidance and mitigation of the risk). I regard an assessment of the worst-case outcome of a hazard as part of such a risk assessment, as do most system safety engineers (for example, it is built in to the definition of “risk” in Leveson’s book Safeware, Addison-Wesley 1995) and sociologists concerned with technological risk (see, for example, Lee Clarke’s book Worst Cases, University of Chicago Press, 2005).

So, this approach is standard in system safety engineering and I think Yurman is ignorant of it. He is by no means the only one. Had the operator Tepco performed such an analysis of the tsunami risk before March 2011, rather than, say, peremptorily dismissing the concerns of a tsumani expert at a meeting at the regulator two years before, we would likely not be discussing an accident at all and the prospects for the future of nuclear power would still seem rosy. Indeed, Tepco had no need to perform such an analysis: it had been done for them. Dave Lockbaum of the UCS had pointed out the dangers of station blackout through flooding the basement equipment of BWRs as early as 1992, and this specific danger, of essential equipment being rendered susceptible to flooding, resulting in a station blackout, was also written out explicitly in Charles Perrow’s book The Next Catastrophe, Princeton University Press, 2005. (Perrow was maybe wrong; it wasn’t the next catastrophe, it was the next-but-one, if you count Deepwater Horizon as a catastrophe).

Davis argues in the ANS article that

there’s no basis to assertions of shaky buildings, or a structurally failed 1F-4 plant, or the chance of zircalloy cladding fire, or billowing of the released material to the entire earth

and recommends

Realistic, practical analysis, performed by personnel on site (TEPCO/NISA), nuclear professionals here in the United States with decades of experience in both theory and practice, and official peer-reviewed studies and documents (e.g., NUREG /CR-4982)

Yes, there is nothing like an appeal to authority to sound authoritative. Keep in mind former Prime Minister Naoto Kan’s recent comments, reported by Martin Fackler in the New York Times on May 28, about the difficulties he had getting reliable information and advice from the operator Tepco in the days of emergency just after the accident, and his conclusion that these characteristics are so entrenched in the power companies and their support structure (the “nuclear village” as he called it) that Japan cannot safely run nuclear power operations. Consider also that Tepco manifestly missed the tsunami risk for 46 years. One can well wonder at the wisdom of taking Tepco at its word. As for those US “nuclear professionals” and “official peer-reviewed studies and documents“, how many of those people have actually performed an on-site inspection of the SFP4 structural modifications, followed by an analysis and assessment? As far as I know, only the operator and its contractors know the details of the structural modifications.

Davis thinks there is “no basis to assertions of….shaky buildings“. I would feel more comfortable if the operator’s design and execution of the structural modifications (including the ad-hoc cooling system) had been assessed by a qualified independent third-party and the results made publicly available. That “independent” bit appears, from recent history, particularly hard to achieve. Tepco claims, according to Davis, that the structural mods have been simulated in design-basis earthquake conditions. One wonders as usual about the assumptions made for the simulation, which obviously include how strong earthquakes behave; our current knowledge of such matters is not particularly reliable. There is also some reason to question whether the plant even adequately withstood the Tohoku quake itself, which is claimed to be within “design basis”.

Davis oddly suggests that “there is no basis for assertions of… billowing of the released material to the entire earth“. In fact, most radioactive material released to the atmosphere becomes circumglobal, as would be apparent to anyone who has looked at such distributions.

Enough of the background chatter. Let’s actually do what I suggested system safety engineers do, from the relative safety of our armchairs thousands of miles away. It’s not hard – it’ll fit into a couple of hundred words.

1. What is the hazard we are concerned with at SFP4? There are actually two.

a. Permanent loss of coolant and thus fuel-rod cover at SFP4 because of a leak or cooling-system failure;

b. Collapse of the SFP4 structure.

2. How could this happen? The structure could be compromised or collapse by itself, people having mistakenly assessed its stability. Or a major earthquake could compromise it.

3. What would be the outcome?

Concerning a: The fuel rods would heat up. The fuel itself is contained in a zirconium cladding, which is under internal pressure from gas (some is intentional; some more gas may have been produced as a result of the high temperatures attained during the cooling emergency in the early weeks of the accident). Zirconium begins to corrode at temperatures of around 100°C, which as far as I can tell are quite likely to be obtained if there is no coolant. After a while, the cladding would be compromised and the hot radioactive material in the fuel rods would be exposed to the atmosphere.

Concerning b: Fuel elements, which are some 4m long and not intended to be dropped from a height, could be damaged through impact if parts of SFP4 collapsed (recall SFP4 is many stories in the air) and could well break open, again exposing the radioactive fuel to the atmosphere.

Exposing this fuel directly to the atmosphere would result in radioactive material being released into the air. How much is released is anyone’s guess – it depends on how many rods are compromised. Once that process starts, it is going to be very difficult to get anyone near enough to it to be able to hinder its progression.

Those are the conclusions that Davis and Yurman would come to if they were able and willing to perform basic system safety analyses of the sort we teach to our undergraduates.



The Accident to Qantas Flight 72, VH-QPA, in October 2008

21 12 2011

The Airbus A330-303 VH-QPA experienced uncommanded nose-down pitch commands while in cruise at FL370. Lots of unsecured people were thrown to the ceiling, and some were injured severely. The aircraft declared an emergency and landed as soon as practicable, at Learmonth, where the injured were treated and several hospitalised. It has been known for a while that the accident was caused by data anomalies from a air data computer (ADIRU) which were not filtered out by the primary flight control computers (FLight Control Primary Computers, FCPC, also known as PRIM). However, it has been a mystery – and remains so – how the anomalous data values were generated. It has happened three times: twice with the unit on VH-QPA, and once on another unit on another aircraft, also Qantas, also in Western Australia, within a couple of months of this incident.

The fix is apparently to modify the BITE test of the ADIRU specifically to look for such anomalies, and to modify the data-filtering algorithms of the Flight Control Primary Computers (FCPC, also known as PRIM) of the A330.

The Final Report is now available on the ATSB WWW site.

There was a note from Andrew Heasley in Risks 26-67 with a title saying the accident was “Blamed on Software“, pointing to a newspaper article. I find this claim misleading. The problem which arose had nothing to do with anything for which any software engineer would have been responsible.

The fixes were implemented in both SW and HW, but fixes to non-SW problems are very often implemented in SW.

The PRIMs ran a data-assurance algorithm for data received from three different ADIRUs, which are electronic boxes built by a different manufacturer. This data assurance algorithm had a specific vulnerability to spiky angle-of-attack (AoA) data presented in a particular time-sequential manner, which was exploited during the occurrence. The algorithm, which uses AoA data from three ADIRUs, filters out multiple data spikes from a unit which occur within a specific time frame. Spikes on the culprit ADIRU occurred with similar values just over the boundary of this time frame, and were thus taken as veridical by the PRIMs. The resolution algorithms for the AoA data (with that from the other ADIRU units) in the PRIMs let these values through, and the PRIMs reacted accordingly by commanding sudden nose-down pitch.

Responsibility for the design of such algorithms lies clearly with those who are experts on the engineering of electronic data generation and transmission equipment, not on any software engineers.

To give a similar example with which I been recently involved, it turns out that signals of certain frequencies in AC electric circuits can bypass the Type A and Type B circuit protection equipment (circuit breakers) that are required in most electric circuits (household and industrial) in Germany. A committee on which I sit has recently considered attaching equipment which is, as far as we know, theoretically capable of generating such frequencies to such circuits. A similar situation, how to handle anomalous signals, but no SW in sight. Pure electrical engineering.

Concerning my earlier note here on Certification Requirements for Commercial Airplanes, I find it interesting and commendable that the Bureau considered likelihoods of events in their summary (quoted below). However, I don’t believe they formulated it in quite the words I would have liked to have read.

They give reason to classify the event as “hazardous”, and with a fleet operating experience of 28 million flight hours this occurrence fits within the expected value (a technical term) of the operating time within which the effects of a hazardous event may occur (defined to be less than or equal to one occurrence within ten million operating hours), according to the acceptable means to determine compliance with certification criteria (now known as AMC 25). Notice it is not the event itself of which they assess the occurrence – that has occurred three times – but the deleterious effects upon safety of the event, which have only occurred once.

They speak of “certification requirements“. Strictly speaking, this is incorrect. The certification requirements are expressed in CS 25 and do not involve probabilities. The severity classification terms “catastrophic”, “”hazardous” etc and their associated acceptable/unacceptable frequencies occur in risk-matrix-type form in the Acceptable Means of Compliance document which accompanies the certification requirements (AMC 25), not the requirements themselves. (I note that these documents were called something slightly different at A330 certification time, 1993).

The certification requirements themselves are quite clear: the airplane shall behave in such-and-such a manner. If a wing falls off, or a flight control computer sends it into a loop, it is obviously not behaving in that manner; thus violating certification requirements. However, it is accepted that one cannot provide proof that such untoward things will never ever happen (will the sun rise tomorrow? Will your steering wheel come off in your hands? WIll your control sidestick come out of its holder in your hand?), so a less strenuous regime based on arguing likelihoods is defined as an “Acceptable Means of Compliance” with the regulations for purpose of certification.

This is not hair-splitting. It has consequences, in particular in this case, for how anomalies are dealt with, as follows.

If the requirement were that, say, “hazardous effects shall only occur on average once in between 10^7 and 10^9 operating hours“, which is what the AMC says you have to show to demonstrate compliance acceptably, then it would have been open to the manufacturer to do nothing in reaction to the QF72 event: the hazardous effects occurred only within the expected time value of their occurrence. If you think about it, it would also be open to a manufacturer to do nothing until the second occurrence of any hazardous or indeed catastrophic effects, even if the problem occurred first within the early experience of flying the aircraft! This is simply a consequence of the meaning of the probabilistic concepts used.

Whereas, as things now stand, separating requirements, which are absolute, from acceptable compliance (which may be based on occurrence frequency) any in-flight anomalous behavior must be fixed or the airworthiness certificate will be withdrawn. This is because such behavior violates the written requirements, that the aircraft shall not behave that way. To repeat, the conditions on behavior are absolute, not likelihood-based.

And that is how one wants things: The requirements are absolute, but it is accepted that in science and engineering you are often only convinced to some degree, so it is regarded as acceptable to argue your conviction up to a certain degree, and not to have to prove it, which would likely be impossible. But if something does go wrong, you want it fixed right away.

One can argue that any given set of occurrences is compatible with any probability requirement whatever, and thus that probabilistic requirements are inappropriate to determine airworthiness in any case. However, I don’t think such an argument works. Say these three events had occurred within 3 million operating hours, each with damage. One could estimate the likelihood that an piece of equipment fulfilling the condition of an expected value of at most once in 10 million operating hours to exhibit three events within 3 million operating hours. One would conclude that it is unlikely, say with small probability P. It follows that the situation that the aircraft fulfills the acceptable-compliance criterion has the same probability P. The small probability P that the aircraft acceptably complied with certification requirements would provide good reason for withdrawing the airworthiness certificate.

Concerning the data anomaly itself stemming from the ADIRU, its cause remains a mystery. The report says:


Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The report says that the manufacturer is developing a modification to the BITE to detect such failure modes:


Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Here is the executive summary. It is well and concisely written. I include the three paragraphs about seat belts and the investigative process for completeness.

Executive Summary

At 0132 Universal Time Coordinated (0932 local time) on 7 October 2008, an Airbus A330-303 aircraft, registered VH-QPA and operated as Qantas flight 72, departed Singapore on a scheduled passenger transport service to Perth, Western Australia. At 0440:26, while the aircraft was in cruise at 37,000 ft, ADIRU 1 started providing intermittent, incorrect values (spikes) on all flight parameters to other aircraft systems. Soon after, the autopilot disconnected and the crew started receiving numerous warning and caution messages (most of them spurious). The other two ADIRUs performed normally during the flight.

At 0442:27, the aircraft suddenly pitched nose down. The FCPCs commanded the pitch-down in response to AOA data spikes from ADIRU 1. Although the pitch-down command lasted less than 2 seconds, the resulting forces were sufficient for almost all the unrestrained occupants to be thrown to the aircraft’s ceiling. At least 110 of the 303 passengers and nine of the 12 crew members were injured; 12 of the occupants were seriously injured and another 39 received hospital medical treatment. The FCPCs commanded a second, less severe pitch-down at 0445:08.
The flight crew’s responses to the emergency were timely and appropriate. Due to the serious injuries and their assessment that there was potential for further pitch-downs, the crew diverted the flight to Learmonth, Western Australia and declared a MAYDAY to air traffic control. The aircraft landed as soon as operationally practicable at 0532, and medical assistance was provided to the injured occupants soon after.

FCPC design limitation

AOA is a critically important flight parameter, and full-authority flight control systems such as those equipping A330/A340 aircraft require accurate AOA data to function properly. The aircraft was fitted with three ADIRUs to provide redundancy and enable fault tolerance, and the FCPCs used the three independent AOA values to check their consistency. In the usual case, when all three AOA values were valid and consistent, the average value of AOA 1 and AOA 2 was used by the FCPCs for their computations. If either AOA 1 or AOA 2 significantly deviated from the other two values, the FCPCs used a memorised value for 1.2 seconds. The FCPC algorithm was very effective, but it could not correctly manage a scenario where there were multiple spikes in either AOA 1 or AOA 2 that were 1.2 seconds apart.

Although there were many injuries on the 7 October 2008 flight, it is very unlikely that the FCPC design limitation could have been associated with a more adverse outcome. Accordingly, the occurrence fitted the classification of a ‘hazardous’ effect rather than a ‘catastrophic’ effect as described by the relevant certification requirements. As the occurrence was the only known case of the design limitation affecting an aircraft’s flightpath in over 28 million flight hours on A330/A340 aircraft, the limitation was within the acceptable probability range defined in the certification requirements for a hazardous effect.

As with other safety-critical systems, the development of the A330/A340 flight control system during 1991 and 1992 had many elements to minimise the risk of a design error. These included peer reviews, a system safety assessment (SSA), and testing and simulations to verify and validate the system requirements. None of these activities identified the design limitation in the FCPC’s AOA algorithm.

The ADIRU failure mode had not been previously encountered, or identified by the ADIRU manufacturer in its safety analysis activities. Overall, the design, verification and validation processes used by the aircraft manufacturer did not fully consider the potential effects of frequent spikes in data from an ADIRU.

ADIRU data-spike failure mode

The data-spike failure mode on the LTN-101 model ADIRU involved intermittent spikes (incorrect values) on air data parameters such as airspeed and AOA being sent to other systems as valid data without a relevant fault message being displayed to the crew. The inertial reference parameters (such as pitch attitude) contained more systematic errors as well as data spikes, and the ADIRU generated a fault message and flagged the output data as invalid. Once the failure mode started, the ADIRU’s abnormal behaviour continued until the unit was shut down. After its power was cycled (turned OFF and ON), the unit performed normally.

There were three known occurrences of the data-spike failure mode. In addition to the 7 October 2008 occurrence, there was an occurrence on 12 September 2006 involving the same ADIRU (serial number 4167) and the same aircraft. The other occurrence on 27 December 2008 involved another of the same operator’s A330 aircraft (VH-QPG) but a different ADIRU (serial number 4122). However, no factors related to the operator’s aircraft configuration, operating practices or maintenance practices were found to be associated with the failure mode.

Many of the data spikes were generated when the ADIRU’s central processor unit (CPU) module intermittently combined the data value from one parameter with the label for another parameter. The exact mechanism that produced this problem could not be determined. However, the failure mode was probably initiated by a single, rare type of trigger event combined with a marginal susceptibility to that type of event within the CPU module’s hardware. The key components of the two affected units were very similar, and overall it was considered likely that only a small number of units exhibited a similar susceptibility.

Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.

The LTN-101 had built-in test equipment (BITE) to detect almost all potential problems that could occur with the ADIRU, including potential failure modes identified by the aircraft manufacturer. However, none of the BITE tests were designed to detect the type of problem that occurred with the air data parameters.

The failure mode has only been observed three times in over 128 million hours of unit operation, and the unit met the aircraft manufacturer’s specifications for reliability and undetected failure rates. Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.

Use of seat belts

At least 60 of the aircraft’s passengers were seated without their seat belts fastened at the time of the first pitch-down. Consistent with previous in-flight upset accidents, the injury rate, and injury severity, was substantially greater for those who were not seated or seated without their seat belts fastened.

Passengers are routinely reminded every flight to keep their seat belts fastened during flight whenever they are seated, but it appears some passengers routinely do not follow this advice. This investigation provided some insights into the types of passengers who may be more likely not to wear seat belts, but it also identified that there has been very little research conducted into this topic by the aviation industry.

Investigation process

The Australian Transport Safety Bureau investigation covered a range of complex issues, including some that had rarely been considered in depth by previous aviation investigations. To do this, the investigation required the expertise and cooperation of several external organisations, including the French Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation civile, US National Transportation Safety Board, the aircraft and FCPC manufacturer (Airbus), the ADIRU manufacturer (Northrop Grumman Corporation), and the operator.



Dealing With Nuclear Waste

2 12 2011

The Independent reports today on a written statement by UK Energy Minister Hendry to Parliament on what the Government is deciding to do with its radioactive waste from nuclear power generation.

The British government has decided for a project to convert plutonium waste into MOX fuel, maybe for “a new generation of nuclear power plants“.

The decision, which ends decades of uncertainty on how to deal with a growing stockpile of more than 112 tonnes of plutonium waste, was presented as a written Parliamentary statement by the energy minister, Charles Hendry.

Indeed for half a century Britain, like many other countries with nuclear power plants, has not known what to do with nuclear power’s most toxic waste product.

Nuclear power relies on highly radioactive “fuel”, formed usually in the shape of rods, which engage in a chain reaction in the core of a nuclear reactor and produce heat. The chain reaction converts substances eventually into other substances which are no longer suitable for purpose; the fuel is “spent” and must be replaced. But the “spent fuel” remains highly radioactive. It is very toxic, must be carefully shielded from the environment and people, and this must go on with current spent fuel for (the most optimistic minimum estimate) 10,000 years (the level at which radioactivity has reduced to that of the originally-mined uranium and the original basis for US standards).

What do you do with it? Where do you put it?

It is not clear that anyone has come close to solving this problem. Nuclear power has been around for half a century, this waste has been accumulating, and the nation with the most plants, the US, has no solution. There are and have been many proposals, but so far none has turned out to be workable. Most of the spent fuel is still stored on-site in pools filled with water (water is pretty good at stopping the neutrons which are the main product of radioactivity in nuclear fuel rods. You only need a few meters of it to trap all but a few which get lost in the background). No one thinks that is a solution for more than a few decades, let alone a minimum of 10,000 years. There is a movement to store as much as possible in so-called “dry casks”: sealed physical containment vessels which are self-cooling after the spent fuel has been sitting around for some number of years. But you still have to put the casks somewhere where they will be safe for a minumum of 10,000 years. Yucca Mountain in Nevada was for many years the preferred prospective location. One wonders, however, about the stability of any structure in a seismically active area of recent volcanism. Eight volcanoes have erupted within 50km of the site in the last million years (op. cit.), but maybe it’s OK for 10,000 years? That is the main point: nobody really knows. No one with a decent set of choices could reasonably choose a place in a seismically and volcanically active area. That says, correctly in my view, that there is no decent set of choices. That is the way it has been for half a century.

It is a problem in Germany also. Germany processes spent fuel in France (and soon in GB) and transports the processed product in dry casks (called “Castor”) by rail back into Germany. The transport has been regularly plagued by protests which block the rail lines, and a transport typically takes days to weeks. Protesters used to aim for Germany’s withdrawal from nuclear power. Now that the German Government has committed to that, what is the latest protest (ongoing at time of writing) about? The protesters are apparently not content with the “temporary” storage site at Gorleben in Lower Saxony (it is in an underground salt deposit, which they claim with some reason is geologically unstable over the long term) and apparently want it to be stored at a reactor site at Philippsburg, near Karlsruhe. That is unlikely to be long term (in the sense of 10,000 years) either, since most authorities judge that any long-term site must be underground, in geologically stable ground. The storage issue has not been solved in Germany, either.

What about Britain? The Independent speaks of

……..decades of uncertainty on how to deal with a growing stockpile of more than 112 tonnes of plutonium waste, was presented as a written Parliamentary statement by the energy minister, Charles Hendry.
Plutonium waste has been a headache for successive governments because it is a highly dangerous radioactive material that can be converted into weapons-grade material, making it a security risk. It’s also expensive to store.

So Britain doesn’t have a long-term solution either. Who does? (Maybe France or Japan?) What to do with the waste is a major unsolved issue with nuclear power.

According to the Independent, the “uncertainty” has gone. It’s going to be converted into “mixed oxide” (MOX) fuel. Fuel? Yes, for reactors which have not yet been built. So you solve the waste problem by building new reactors – which, um, then don’t create waste? Of course they do. You are thus using the present waste in a process which will ultimately generate even more waste, as well of course as some electricity. So, problem solved? Obviously not.

Suppose one just wants to store MOX fuel, not use it. Is it, say, less toxic than spent fuel? No. Can be stored more easily? Not as far as I know. Can be used somehow? Yes, in those new nuclear power plants; we’ve just been that route.

Does this solve the nuclear-waste-product problem in any reasonable way? No. Since the UK government is full of clever people who can think at least this far, it could be that there is another explanation for this decision.

One thought. Somebody will be paid £3bn pounds for doing it, if it happens. Money goes somewhere, and I imagine the prospective recipients might be rather keen on their share. The new waste generated by the new reactors that use the MOX fuel that came from the old waste is, well, a problem for someone who comes along later. Science will solve everything, won’t it?

But it’s not going to be clear sailing. The Independent continues:

But although Mr Hendry made it clear that the Government sees the “Mox option” as a priority, it is not certain that a new £3bn plant to convert the plutonium into Mox fuel will ever be built.

Mindful of the financial and technological disaster of the current Mox fuel plant at Sellafield in Cumbria, which has cost £1.34bn and produced a tiny fraction of the fuel it was scheduled to make, Mr Hendry said that a clear case has still to be made for a second Mox plant at Sellafield.

Oh. So the first, smaller attempt to do this kind of thing failed?

Well, let me qualify that. £1.34bn went somewhere, somebody got it for doing something, so that all went OK. But it apparently didn’t go into the ostensible goal of processing X amount of plutonium into MOX.

And on the basis of that experience apparently the best option is to try again, more and bigger?

I am sure the mistakes made in building the first reprocessing plant will all have been cataloged. I am also sure that attempts will be made assiduously to avoid them when building the second, bigger plant. I have also studied troubled large projects, indeed giving evidence before a UK Parliamentary committee on one. Many big projects fail to deliver on the goals at the time of commencement. Indeed, it’s a first for me to see someone suggest a larger second project on the back of a failed, smaller first one. Surely it should be received wisdom by now that any serious, careful estimate of the cost of such a second, bigger plant be accompanied with an equally serious, careful estimate of the likelihood of success or failure?

Given that this plan for apparently “dealing with” nuclear waste leaves all the questions open about how one ultimately deals with the waste, could something else be going on? What could it be?

First, contractors earn money for building the plant, whether it works or not, so they would be happy. Second, a current government can be seen to be “doing something” about the problem, no matter how superficial. Third, by processing and reusing fuel, the issue of what finally to do about the nuclear waste is put off into the future. (That strategy has clearly worked for governments in the past!)

Let us, though, be clear what the situation is. There is a real scientific and social problem of what on earth one can do with the highly toxic waste products of fission reactors. One cannot expect the current UK government, indeed any government at all, to implement a true solution when none is known yet to exist.

So maybe the Independent is being inappropriately forthright when it claims that uncertainty is at an end. Here is what Energy Minister Hendry actually wrote, as reported by the Independent:

“Only when the Government is confident that its preferred option could be implemented safely and securely, that is affordable, deliverable, and offers value for money, will it be in a position to proceed with a new Mox plant,” Mr Hendry said. In its response to a public consultation on Britain’s plutonium problem, the Government has not rejected other options. One is to convert the 112 tonnes of plutonium dioxide powder stored at Sellafield into glass or concrete blocks that could be buried permanently in a deep waste repository. Another is to use the plutonium directly as fuel for fast reactors, if these can be developed commercially in the coming decade.

“While converting the plutonium into Mox is the most credible and technologically mature option, the Government remains open to any alternative proposals for plutonium management that offer better value to the taxpayer, and will seek to gather more details on all options,” Mr Hendry said.

That seems less than certain to me. According to this, the UK government has set priorities on the “viable” options. It has not actually decided to do anything.

So am I (and the Independent) making a lot of fuss about not very much? Here’s a thought. We all agree that something does indeed need to be done about nuclear waste. Suppose somebody “does something”, what is it going to be? Well, it’s going to be starting to implement this “plan”, since, as the priority option, it is obviously the thing to pick if anything is to be done.

But options remain open. In case a detractor says “why on earth are you doing this? It makes no sense“, the Energy Minister can reply “only when we are confident, etc, etc, the Government remains open to any alternative proposals, etc.

And when a sufficient amount of money has been spent, someone can say “oh look, we’ve got half a MOX plant! Well, better get on and finish it, then! Don’t like to waste money…..

Maybe it’s just the time of year. I haven’t hung my Christmas lights either. Or maybe the UK government has been reading its seasonal literature and the nuclear contractors hired a lobbyist name of Bob Cratchit.



Assurance of Cyber-Physical Systems

17 11 2011

I attended Seminar 11441 on Science and Engineering of Cyber-Physical Systems at the Leibniz Centre for Informatics at Schloss Dagstuhl in the Saarland on 1-4 November, 2011. It was organised by Holger Giese, Bernhard Rumpe, Bernhard Schätz and Janos Sztipanovits. There is huge interest in cyber-physical systems in the US at the moment, backed by plenty of research resources, and in Germany also, although on a lesser scale, somewhat more industrially-oriented and mostly concentrated in the South, it appears.

I attached myself to the subgroup concerned with the assurance and certification of such systems.

We all seemed to have a whale of a time figuring out what a cyber-physical system (CPS) is. Tom Maibaum and others wondered how they might differ from embedded systems. People said, well, it is important that there are lots of subsystems interacting more loosely than with a hierarchically-developed complex embedded system. So John Fitzgerald wondered whether they were mostly systems of systems. (Actually, the “so” is causally misplaced. John, being an “F”, had his one-minute say before Tom, being an “M”). Social systems of mostly artificial agents, of which many examples were given, seemed to fit the “cyber-physical” conception, so CPS includes at least those. Platooning road and rail vehicles, swarms of robotic aircraft or ground robots, coordinated flying or other motion, coordinated searching tasks, and so on. There are enough examples to point and say “that’s what we mean!”.

I also learnt, once again (strange how short one’s memory can be!) to avoid uttering the phrase “emergent behavior”, at the risk of inciting a riot, or at least the closest one can come to a riot at a Dagstuhl seminar.

So what about assurance of such systems? Sadly, as I was on my way back, having had a beautiful bike ride back over the Hunsrück to Trier and caught the train, there occurred a horrendous road accident in Britain on the M5. You can read commentary about it on the York safety-critical systems mailing list. Go to The 2011 collection, sort by date, read the contributions on Sunday 6 November through Tuesday 8 November including “M5 Road Accident” in the title, or go to Paul Cleary’s initiating query and follow the thread(s) through (there are two slightly different titles, but the thread-following links persist through). I also had some private correspondence with Gérard Le Lann, who now works on road-vehicle platooning algorithms and associated questions.

As a result of the Dagstuhl discussions, and the e-mail discussions of the accident, I was able more concretely to formulate what I think is a new assurance problem which arises with (this conception of) cyber-physical systems. It is a little too long for a blog post, so I wrote it in a note called The Assurance of Cyber-Physical Systems: Auffahr Accidents and Rational Cognitive Model Checking and put it on the RVS WWW site Publications page.



The Definition of Risk – Yet Again

16 11 2011

In a message to the York Safety-Critical Systems Mailing List, Tracy White recounted a discussion with someone from the field of “Risk Management” who was taking a course he was giving on system safety. There is apparently a series of international standards, designated ISO 31000, on “Risk Management” (so says Wikipedia ). Tracy says

The term ‘risk’ in 31000 is described as the ‘effect of uncertainty on objectives’ where one of the ‘effects’ can be ‘a deviation from the expected’ (4360 describes it more succinctly as: ‘a chance of something happening’). These ‘risk’ definitions differ markedly from…..

…the standard definition which has been around for 300 years and 10 months: Abraham de Moivre, De Mensura Sortis, or On the Measurement of Chance, Phil. Trans. Roy. Soc No. 329, January, February, March 1711, reprinted with a commentary by O. Hald in International Statistical Review 52(3):229-262, 1984, which may be retrieved from JSTOR. The definition given there is, in modern terms, that risk is the expected value of loss. “Expected value” is a technical term from probability. I give the word-for-word de Moivre definition below.

This definition is also that used for “risk” in finance. See Peter L. Bernstein, Against the Gods: The Remarkable Story of Risk, John Wiley & Sons, 1996/1998. Which book, as the publisher proudly proclaims on the cover, was a “Business Week, New York Times Business, and USA Today Bestseller” and includes praise from reviews by Galbraith, Heilbronner, the NYT, the WSJ and The Economist on its cover. (Indeed, Bernstein is where I got my original lead to Le Moivre).

The meaning of the term in system safety is always close to that of de Moivre, but usually avoids the explicit arithmetic of finance, expected value of loss, by saying “combination of” likelihood and severity. There are good reasons for being somewhat vague, namely that in many cases in system safety the numbers are not there to enable a calculation of expected value. Especially, for example, in a completely new type of system. (An example I am currently working on is the recharging systems for electric road vehicles. There aren’t many around, so in particular there are no reliable numbers on frequencies of untoward things happening.) In response to this common situation, engineers have developed “qualitative” and “semi-quantitative” methods for assessing risk.

One of the issues then becomes what you take the word to mean in technical contexts. Any definition which is not equivalent to the expected value of loss defines a different concept from that, but the same word, “risk”, is used. For good reason: most definitions are conceptually related and the main issue is to get “close” while not having all the numbers.

So what do you do when some branch of human activity, indeed apparently some standard, takes the same word, “risk”, and uses it to mean something different? (I don’t actually know what “effect of uncertainty on objectives” is supposed to mean. I don’t see how “objectives” can be affected by uncertainty. I can see how your chances of attaining them are.)

Well, maybe you cite de Moivre, the finance industry, and system safety use, and say to your correspondant “you mean something different. I think that is unhelpful; and indeed our notion has historical precedence, so for the purposes of this conversation let’s use a different word for your new notion.” Or heshe could say the same to you. In any case, you agree to use two different words.

And for good measure, you write a blog post about it, as here.

This is not a new issue. Here’s a story from six and a half years ago. In the May/June 2005 issue of IEEE Software, Richard Fairley proposed a definition of risk for the Software Engineering Glossary of the IEEE (which is supposed to be canonical, although it turns out that Prof. Fairley doesn’t think so):

(Richard Fairley, proposed IEEE Software Engineering Glossary): The probability of incurring a loss or enduring a negative impact.

So a risk is a to be a probability, which means all risks have values between 0 and 1. Tell that to Lehmann Brothers. Well, I guess you can’t any more. Try Bear Stearns and Morgan Stanley. But we’re talking software, not money.

In common use, someone talking to his teenager speaking of “the risk of your not catching the bus in time” is likely talking about the chances of that event. Someone talking of “the risk that Lehman Brothers will go under” is likely also meaning the chances. But someone talking of “the risk of Lehmann Brothers going under” is likely also thinking of the repercussions as well as just the chances. So much meaning can a relative pronoun versus a copula+gerund carry! As with any other term you wish to be a technical term, you need to decide which meaning (of, here, two) you are going to use. And stick with it. What should be clear is that software engineers working in safety-critical systems need to speak both of likelihoods or chances, and about expected levels of loss. It seems obvious to use “chance” or “likelihood” or “probability” for the former, and some other word for the latter. Since it has been called “risk” for 300 years, why not carry on doing so? And so it is. But some people choose differently. If one is then going to use “risk” to mean “likelihood”, what word does one choose to mean the combination of likelihood and severity? There is not an obvious candidate. But you do need a word for it.

I wrote to the author, Prof. Fairley, Richard Thayer, the person overall responsible for the SW Glossary, and Merlin Dorfman, I believe the IEEE editor responsible for the section, pointing out de Moivre’s definition, the definition from Nancy Leveson’s book Safeware (Addison-Wesley, 1995), and that from the standard for functional safety of E/E/PE systems, IEC 61508, which all cohere modulo the caveats above.

Here is de Moivre:

The Risk of losing any sum is the reverse of Expectation, and the true measure of it is, the product of the Sum adventured multiplied by the Probability of the Loss

Here is Nancy Leveson:

the hazard level combined with (1) the likelihood of the hazard leading to an accident… and (2) hazard exposure or duration…

[The notion of hazard level is] the combination of severity and likelihood of occurrence.

Here is IEC 61508:

combination of the probability of the occurrence of harm and the severity of that harm

I also copied my note to Fairley in this note to the York Safety-Critical Mailing List.

Dorfman agreed that the definition could be misunderstood, but that “I believe the reader is given a fair, complete, and accurate picture of the use of terminology in this area.”. “Accurate”?

What do you do if you are a sofware engineer working in safety-critical systems? Use the IEEE SE Glossary definition, or use the IEC 61508 definition? Use different definitions for different meetings, depending on who is there? And what happens if you misjudge your audience?

Thayer was dismissive. The entire content of his reply:

The overall title of the glossary is Software Engineering Glossary.  This covers it I believe. 

In other words, he doesn’t care much for the dilemma of the software engineer working in safety-critical systems. One could well wonder why he is editing this vocabulary if he doesn’t care about such issues.

I responded to Thayer and Dorfman:

The use in finance and in PRA of the notion of risk equates it to the expected value of loss. A partial list of standards that use some version of this notion is

* IEC 61508, the international standard on functional safety of E/E/PE
safety-related systems
* IEC 300, the international standard on dependability management, in
Part 3, Section 9, “Risk analysis of technological systems”
* IEEE 1228, the standard for software safety plans
* the American Institute of Chemical Engineers guidelines for safe
automation of chemical processes
* US DoD MIL STD 882C, System Safety Program Requirements
* USAF Systems Command, Software Risk Abatement
* CENELEC 50129, Railway applications: Safety related electronic systems
for signalling (the European norm for railways; derivative from IEC
61508)
* European Space Agency Glossary of Terms
* UK Ministry of Defence Standards 00-56, safety
management requirements for defence systems; and Def Stan 00-58,
HAZOP studies on systems containing programmable electronics
* German Standards Institute (DIN), DIN-V-VDE 0801, Principles for
computers in safety-related systems

In particular, I expressed my concern that the IEEE as an organisation had publically given two meanings for risk pertaining to software engineering: one in IEEE 1228 on software safety plans, and another in the Glossary proposed by Prof. Fairley. I got no response.

Prof. Fairley responded, inter alia:

Concerning my definition of risk:  In most, if not all, situations encountered in software engineering, “risk” is the composite result of numerous factors.  In the glossary, I characterize these as “risk factors,” each of which is assigned a probability and an impact (or a range of each).  Risk factors are usually interrelated (e.g., an inaccurate size estimate affects schedule, budget, memory usage; an inaccurate schedule estimate affects product quality) so overall risk (i.e., probability of suffering loss) must be calculated using conditional probabilities or Bayesian analysis.  It is not possible to characterize a situation by a simplistic pair of numbers, unless one is dealing with a narrow, well-defined situation such as a game of chance.  It is dangerous and misleading to attempt to characterize a complex situation in this way.

Given the constraints of a glossary, it was not possible to explain the rationale for my definition or why it differs from the traditional definition; nor was it possible to explain the basis of definition for the other terms in the glossary.

Which to my mind is confused. If risk is “the composite result of a number of factors” each of which is “assigned a probability and an impact”, why ignore the impact and define it as a probability? Either it is a probability simpliciter, or it is the composition of a number of items, each of which exhibits a probability and an “impact”. It can’t be both.

That was it. End of story. The section editor thinks the definition is “accurate”; the Glossary editor is unconcerned; the author is confused. No one seems to worry about the IEEE proposing two incompatible definitions of risk in software contexts.

I wrote to some colleagues I thought might be interested: Dave Parnas, John Knight and Bev Littlewood (as well as a couple of German colleagues), explaining my dissatisfaction with this state of affairs.

Dave sympathised with my frustration, which was similar to his. He said he had seen lots of examples, and that he considered trying to write a glossary for SW terms a fool’s errand, and explained why. John thought this situation to be serious, the Fairley definition of risk wrong, and deserving of public correction. He also said that many people are concerned about a lack of precision and took Dave’s comments to reflect that. Bev strongly agreed with both John and Dave. He was particularly concerned about the dismissive response.

Continuing along the same lines, here is the definition of risk from the US National Research Council study Understanding Risk: Informing Decisions in a Democratic Society (National Academies Press, 1996), p215 (you can read this study on-line):


A concept used to give meaning to things, forces or circumstances that pose danger to people or to what they value. Descriptions of risk are typically stated in terms of the likelihood of harm or loss from a hazard and usually include: an identification of what is “at risk” and may be harmed or lost (e.g., health of human beings or of an ecosystem, personal property, quality of life, ability to carry on an economic activity); the hazard that may occasion this loss; and a judgement about the likelihood that harm will occur.

So descriptions include a likelihood of harm and an identification of what may be harmed or lost. Unless you are a software engineer using the IEEE Glossary (but not IEEE 1228), in which case it’s just a number between 0 and 1.

Here is the definition from a standard text, Probabilistic Risk Assessment and Management for Engineers and Scientists, Hiromitsu Kumamoto and Ernest J. Henley, IEEE Press (them again!) 1996, a book “sponsored by the IEEE Reliability Society”, p2:

Primary Definition of Risk: A weather forecast such as “30% chance of rain tomorrow” gives two outcomes together with their likelihoods: (30%, rain) and (70%, no rain). Risk is defined as a collection of such pairs of likelihoods and outcomes:

{(30%,rain), (70%, no rain)}

So they don’t even go for the combination of likelihood and outcome, nor do they designate certain outcomes as harmful. But if you do designate certain outcomes as harmful, then you can combine these values to calculate de Moivre risk and system-safety risk from this set.

The standard textbook Probabilistic Risk Analysis: Foundations and Methods, Tim Bedford and Roger Cooke, Cambridge University Press, 2001 (not the IEEE for a change :-) ), discusses the definition of risk over some three pages in Section 1.2. They base their notion on that of S. Kaplan and B.J. Garrick, On the Quantitative Definition of Risk, Risk Analysis 1:11-27, 1981.

A risk analysis tries to answer the questions
(i)What can happen?
(ii)How likely is it to happen?
(iii)Given that it occurs, what are the consequences?

Kaplan and Garrick … define risk to be a series of scenarios s_i, each of which has a probability p_i and a consequence x_i.If the scenarios are ordered in terms of increasing severity of the consequences, then a risk curve can be plotted [of severity against probability of at least that level of severity]. The risk curve illustrates what is the probability of at least a certain number of casualities in a given year. Kaplan and Gattrick…. further refine the notion of risk in the following way [to talk about frequency of an event instead of probability, and then uncertainty associated with a frequency]

Again, this concept is somewhat different from that of a number between 0 and 1.

John suggested I contact the then-editor of IEEE Software, Warren Harrison, which I did. Warren suggested that the appropiate action would be a letter to the editor, allowing the author and the section and glossary editors to respond if they wished.

I never did so. I regret it.

So six and a half years later, here I am writing a blog post on it. I doubt the issue will go away. Neither will this note. I do think the IEEE should work to get its definitional house in order.



Ensuring Safety Requirements Fulfilment in Possibly-Imperfect Software

16 10 2011

Ludi Benner just asked me privately about the feasibility of dumping stack traces from operating SW in flight. I concluded that it is not a very practical idea for a number of reasons. First, there is a lot of it. Second, you can’t analyse them for every flight, because there aren’t human resources for it, and no automatic tools which can detect coding errors from stack traces. Third, even if you analysed them in the case of an accident, there has as yet been no accident in which coding error was suspected (although there have been accidents and incidents in which requirements or design failure of computer-based systems was a causal factor), so even had they been available, no one would have needed to look at them.

Looking at stack traces is also a primarily a measure for assessing software quality. You can tell from a stack trace maybe whether the SW was doing what it was required/designed to do, and thus detect coding errors. But in safety-critical systems you are not interested primarily in deviation from requirements in general, you are interested primarily in deviation from the safety requirements.

There is a general method for formulating safety requirements:
(I) identify hazards (however you might define them), and
(II) then formulate a safety requirement per hazard H as S.H = [either avoid hazard H or exit out of hazard H within Q.H time period], and
(III) define the safety requirements as ( /\/\ S.H), the conjunction being taken over all hazards H

Nancy Leveson defines hazards as states of the system such that……… (Leveson, Safeware, Addison-Wesley 1995, Chapter 9). Others speak of states of the system+environment such that ……  or events such that…… (see for example Chapter 4 of my 2001 on-line book, and Chapter 5 , and this set of definitions from Causalis)

Let’s use the Leveson definition.

The fundamental insight is this. Suppose you have relatively complete safety requirements ( the definition from an earlier blog post ). Then you can insert monitoring SW to look at SW-state, detect hazard states H when they occur, (this can be achieved by techniques for run-time verification – I shall call it here logical monitoring) and then you trap to SW which either exits or mitigates H with worst-case execution time (WCET) Q.H.

For this to work unfailingly, the following conditions have to be fulfilled

(a) your safety requirements are relatively complete,
(b) the hazard-detection is perfect (a “perfect oracle”), and
(c) the (H-exit or H-mitigation) SW is perfect,

This suffices to ensure that the SW does not engender dangerous behavior. Assumption (a) ensures that fulfilment of the safety requirements suffices to avoid dangerous behavior. Assumptions (b) and (c) ensure the safety requirement associated with hazard H is fulfilled. The assumption of perfection for the detection SW in (b) and the avoidance/mitigation software in (c) is critical. As is the condition that, when the perfect oracle detects the presence of H, the trap to the avoidance/mitigation software is also perfect. However, such traps are HW-based and a failure of such a trap could occur due to a HW problem. (Of course, the victims are unlikely to care whether a trap failure is classified as HW or SW).

Logical monitoring is, I propose, to the point at which (b) and (c) are practical. John Rushby notified me of this in December 2009, pointing me to a brief survey of his which I found helpful. I don’t belong to the run-time verification “community”, although I knew about it in general (Manuel Blum and others at Berkeley whom I know had been working on it theoretically a couple of decades ago). So I am proposing it as it were on hearsay rather than through personal experience. It seems to me to be plausible that one can synthesise perfect oracles as well as perfect avoidance/mitigation software.

Such software added to safety-critical SW would be possibly-perfect software in the Strigini sense, that is, software which you would like to be perfect, which you have good reason to think is perfect, and the question is mainly the confidence you have in your judgement that it is. Possible-perfect software and its use in achieving demonstrably-ultrahigh-reliability software has been recently discussed in Littlewood and Rushby’s forthcoming paper in the IEEE Transactions on Software Engineering which I think is a landmark paper.

There are then two questions.

One is assessing the level of confidence we could have that such logical-monitoring software is indeed perfect, and how that would affect the level of confidence we have in the exceptionless fulfilment of the safety requirements for the otherwise possible imperfect SW in which this logical monitoring is inserted. I suspect that techniques such as exhibited in the Littlewood-Rushby paper are applicable.

The second question is also twofold:

(i) whether, for every hazard H, it is the case that a safety requirement of the form S.H = [either avoid hazard H or exit out of hazard H within Q.H time period] suffices to avoid all dangerous consequences of H, and
(i) whether it is possible to produce such avoidance/mitigation SW with WCET less than Q.H

Concerning (i), logical monitoring SW cannot help in avoiding H. It detects when H is present (step (b) ). So if harmful consequences of H can occur within a shorter time period than it could possibly take to detect, trap and exit H, this approach cannot be guaranteed to fulfil the safety requirements for the SW. However, in such a case I suspect very much that the software should be redesigned to ensure the avoidance of H. Since H is a SW state, I see no reason why this should not be generally possible.