Risk

12 01 2016

There are a few different notions of risk used in dependability engineering.

One notion, used in finance and in engineering safety, is from De Moivre (1712, De Mensura Sortis in the Proceedings of the Royal Society) and is


(A) the expected value of loss (people in engineering say “combination of severity and likelihood”).

A second notion, used in quality-control and project management, is


(B) the chances that things will go wrong.

Suppose you have a €20 note in your pocket to buy stuff, and there is an evens chance that it will drop out of your pocket on the way to the store. Then according to (A) your risk is -€10 (= €20 x 0.5) and according to (B) your risk is 0.5 (or 50%). Notice that your risk according to (A) has units which are the units of loss (often monetary units) whereas your risk according to (B) has no units, and is conventionally a number between 0 and 1 inclusive.

(A) and (B) are notions 2 and 3 in the Wikipedia article on Risk, for what it’s worth.

The International Standards Organisation (ISO) and the International Electrotechnical Commission (IEC) put out guides to the inclusion of common aspects in international standards. One is on Safety Aspects (Guide 51, 2014 edition) and one is on Risk Management (Guide 73, 2009 edition). The Guide 51 definition of risk is the combination of probability of occurrence of harm and the severity of that harm, where harmis injury or damage to the health of people, or damage to property or the environment. The Guide 73 definition of risk used to be change or probability of loss, i.e. (B), but has changed in the 2009 edition to the effect of uncertainty on objectives.

The 2013 edition of ISO/IEC 15026 Systems and Software Engineering – Systems and Software Assurance, Part 1: Concepts and Vocabulary (formally denoted ISO/IEC 51026-1:2013), defines risk to be the combination of the probability of an event and its consequence, so (A).

The IEEE-supported Software Engineering Body of Knowledge (SWEBOK) says, in Section 2.5 on Risk Management,


Risk identification and analysis (what can go wrong, how and why, and what are the likely consequences), critical risk assessment (which are the most significant risks in terms of exposure, which can we do something about in terms of leverage), risk mitigation and contingency planning (formulating a strategy to deal with risks and to manage the risk profile) are all undertaken. Risk assessment methods (for example, decision trees and process simulations) should be used in order to highlight and evaluate risks.

Notice what can go wrong is hazard identification, how and why is analysis, along with what are the likely consequences, which is severity assessment, also part of hazard analysis. What is missing here is an assessment of likelihood, which is common to both (A) and (B), the Guide 51 definition and the Guide 73 definition.

ISO/IEC 24765:2010 Systems and Software Engineering – Vocabulary defines risk to be


1. an uncertain event or condition that, if it occurs, has a positive or negative effect on a project’s objectives. A Guide to the Project Management Body of Knowledge (PMBOK® Guide) — Fourth Edition.
2. the combination of the probability of an abnormal event or failure and the consequence(s) of that event or failure to a system’s components, operators, users, or environment. IEEE Std 829-2008 IEEE Standard for Software and System Test Documentation.3.1.30.
3. the combination of the probability of an event and its consequence. ISO/IEC 16085:2006 (IEEE Std 16085-2006), Systems and software engineering — Life cycle processes — Risk management.3.5; ISO/IEC 38500:2008, Corporate governance of information technology.1.6.14.
4. a measure that combines both the likelihood that a system hazard will cause an accident and the severity of that accident. IEEE Std 1228-1994 (R2002) IEEE Standard for Software Safety Plans.3.1.3.
5. a function of the probability of occurrence of a given threat and the potential adverse consequences of that threat’s occurrence. ISO/IEC 15026:1998, Information technology — System and software integrity levels.3.12.
6. the combination of the probability of occurrence and the consequences of a given future undesirable event. IEEE Std 829-2008 IEEE Standard for Software and System Test Documentation.3.1.30

ISO/IEC 24765 thus acknowledges that there are different notions doing the rounds.

The System Engineering Body of Knowledge (SEBOK) says in its Wiki page on Risk Management that


Risk is a measure of the potential inability to achieve overall program objectives within defined cost, schedule, and technical constraints. It has the following two components (DAU 2003a):

the probability (or likelihood) of failing to achieve a particular outcome
the consequences (or impact) of failing to achieve that outcome

which is a version of (A).

What are the subconcepts underlying (A) and (B), and other conceptions of risk?

(1) There is vulnerability. Vulnerability is the hazard, along with the damage that could result from it, and the extent of that damage; this is often called “severity”. So: hazard + hazard-severity. This is close to Definition 1 of ISO/IEC 24765.
(2) There is likelihood. This can be likelihood that the hazard is realised (assuming worst-case severity) or likelihood that a specific extent of damage will result. This is only meaningful when events have a stochastic character. This is (B), the former definition in ISO/IEC Guide 73, and item 3 in the Wikipedia list.

If you have (1) and (2), you have (A) and (B). If you have (A) and (B), you have (2) (=B) but you don’t have (1). But (1) is what you need to talk about security, because security incidents do not generally have a stochastic nature.

Terje Aven, in his book Misconceptions of Risk argues (in Chapter 1) that even notion (A) is inadequate to capture essential aspects of risk. He attributes to Daniel Bernoulli the observation that utility is important: just knowing expected value of loss is insufficient to enable some pertinent decisions to be made about the particular risky situation one is in.

A third subconcept underlying risk is that of uncertainty. Aven has argued recently that uncertainty is an appropriate replacement for probability in the notion of risk. Uncertainty is related to what one knows, to knowledge, and of course the Bayesian concept of probability is based upon evaluating relative certainty/uncertainty.

It is worthwhile to think of characterising risk in terms of uncertainty where traditional probability is regarded as inappropriate. However, there are circumstances in which it can be argued that probabilities are objective features of the world; quantum-mechanical effects, for example. And if a system operates in an environment of which the parameters pertinent for system behavior have a stochastic nature, no matter how much of this is attributable to a lack of knowledge (a failure to observe or understand causal mechanisms, for example) and how much to objective variation, such probabilities surely must play a role as input to a risk assessment.



Water and Electricity

28 12 2015

We do know that they don’t mix well.

In an article in the Guardian about the floods in York, I read about the flood barrier on the River Foss that


Problems arose at the weekend at the Foss barrier and pumping station, which controls river levels by managing the interaction between the rivers Foss and Ouse. In a model that is commonplace around the country, pumps behind the barrier are supposed to pump the water clear. The station became inundated with floodwater after the volume exceeded the capacity of the pumps and flooded some of the electrics, according to an Environment Agency spokesperson, who said that a helicopter was due to airlift in parts to complete repairs on Monday.

It is particularly ironic that flood-control measures are rendered ineffective through flooding of their controls. But it’s not a one-off.

At the beginning of this month, December 6, much of the city of Lancaster (and reportedly 55,000 people) were left without power when an electricity substation in Caton Road was flooded in a previous storm.

Here is Roger Kemp’s take on the Lancaster substation affair.

In March 2011, when the tsunami resulting from the Tohoku earthquake flooded the Fukushima Daichi nuclear power station, the electrics for the emergency backup generators were also awash. If I remember correctly, in the US some of the Mark I BWRs had been modified so that the electrics controlling the emergency power generation were installed higher up in the buildings than the basement, where they still were at Fukushima Daichi. The bluff on which the power station was built had also been lowered by 15m during building to enable easier access from the seaward side.

I’ll leave it to readers to connect the dots. The question is whether the resources will be made available in the UK for review of the placement of critical electrics, and for prophylaxis. And also what we can do as members of professional societies for electrotechnology to encourage those resources to be mobilised, in the UK and elsewhere – big cities in Germany such as Hamburg and Dresden have been flooded in recent years.

I suspect that some measures would be relatively simple to implement, for example putting effective sealing accessways on vulnerable substations and other critical installations. Maybe one could seal at ground-level permanently, install sealed doors two meters up, with steps on both sides? And install an effective pump for what might nevertheless leak through a seal, with starting through a sealed battery with a float-activated switch. And so on.

As societies, we are becoming more dependent on electricity as an essential component of living, and there are plans to become even more so. This leads to vulnerabilities which I believe we haven’t yet thoroughly considered.

When I was a child, house heating came through burning coal, coke or occasionally wood. If there was an electricity cut, you could still heat your house. Nowadays, almost all building heating is electrically controlled. Even fancy wood-pellet-burning stoves, which may be connected to the circulating heating water. Take out the electricity, take out the heating too, nowadays.

According to EU statistics, in 2013 11.8% of inland energy consumption in the EU-28 was from renewable resources, and in the same year 25.4% of electricity was generated from renewable resources. Which suggests that less than half of energy consumption in the EU-28 is via electricity; much of the rest will be transportation, I suppose. Transportation’s use of energy from renewable resources was only 5.4% in 2013. There is scope for change – everyone seems to be thinking about electric road vehicles (ERVs).

I doubt whether the infrastructure exists to supply appropriate amounts of electricity for recharging ERVs if they constituted a large proportion of vehicle use, and I am not alone. The RAEng suggested in 2010 that current supply could be “overwhelmed” (Roger Kemp chaired the committee which produced the report.)

Amongst the issues are quality of electrical infrastructure. The German electrotechnical industry association ZVEI pointed out some years ago that 70% of building electrical installations have outlived their design lifetime of 30-35 years and are still in operation; also that 50 years ago there were typically 6-8 electrical devices in the average household, and now there are typically more than 70. In the presentation in which these figures appear, they were more worried about the functional safety of the installations, in particular fire risk. Malfunctioning electrics causes 15-20% of all building fires in Germany, they say. If I remember rightly, about ten times as many people die per year in building fires caused by electrical malfunction as die from electrocution: 200 as compared with 15-20. I don’t recall anything in the presentations I have seen on vulnerabilities to flooding.

When York, Lancaster and Leeds have streets lined with charging points for ERVs, I hope those points are adequately protected from floods. When half the cars along a street are electric, and the street floods to a meter depth, what is going to happen to and around those cars? Would you touch one after the floods recede? Recall there is enough stored energy in a fully-loaded vehicle to power your average Western house for a few days.

I spent a couple of years on and around German standardisation committees on ERVs. In all the meetings, I don’t recall questions concerning effects of submersion ever arising. I think they should be considered.



Kissinger on SDI and the Soviet Collapse

13 11 2015

I’ve been reading Henry Kissinger’s “summation” of international relations, World Order, which is as interesting and insightful as people have said.

He says of SDI that

[Reagan] challenged the Soviet Union to a race in arms and technology that it could not win, based on programs long stymied in Congress. What came to be known as the Strategic Defence Initiative – a defensive shield against missile attack – was largely derided in Congress and the media when Reagan put it forward. Today it is widely credited with convincing the Soviet leadership of the futility of its arms race with the United States.

He says later,

…without Reagan’s idealism – bordering sometimes on a repudiation of history – the end of the Soviet challenge could not have occurred amidst such a global affirmation of a democratic future.

By “Reagan’s idealism“, Kissinger explicitly means the idea of the “shining city on a hill“, which he says “was not a metaphor for Reagan; it actually existed for him because he willed it to exist.

Kissinger uses the “key people in positions of power” theory of the mechanisms of international relations while explaining the continuity of US foreign policy from Nixon through Ford, Carter and Reagan. Such an assertion of continuity might surprise those who were actually present during the period, but Kissinger’s argument for it is coherent, as one might expect.

Kissinger hedges his point about SDI by not actually appropriating it – he says “widely credited“, and that is correct, I think. But that doesn’t mean it’s fact.

Let me propose an alternative view, in which it was one of two major factors (amongst a plethora of others).

George Kennan foresaw how things would progress in 1947. It might be said that his view, more widely spread, established the Cold War and predicted its denouement. It had been clear for a long time by the mid-1980′s that US productivity, when channelled into military spending, could outrun that of the Soviet Union in the long term, but no one knew how long that term would be. I seem to recall some reports that the Soviets were putting 40% of their productivity into military kit, and for all anyone knew maybe they could raise that to 60%, because it could have been seen as more important than feeding people. Whereas there was no appetite in the US for even 20% spending on the military, after the Vietnam war.

SDI was in the first place an escalation of resource consumption. It wasn’t based on a Reagan decision alone; it was based more generally on fantasy in the US military, of which there was a plentiful supply. I remember an eminent colleague in the mid-80′s recounting a meeting with a USAF general officer whose vision consisted of a helmet which could read and execute the thoughts of a fighter pilot: “fly there, do that, shoot that; I just THINK about it and it happens“. Thirty years later, bits of that have been implemented. Whereas the SDI vision is having trouble achieving even a 50% success rate in one-on-one anti-ICBM-missile trials, according to the table in this 2014 article. Now, I suspect well-grounded Soviet military technologists knew as well as well-grounded US military technologists that SDI at that point in the 1980′s was fantasy. The arguments are not hard; they were, as expressed by David Parnas, convincing, true and public. Some people in the Soviet Union surely must have known that SDI was bluff.

So what was SDI’s role in the Soviet collapse? I suggest it may have been half of it. The other half was Reagan suggesting directly to Gorbachev that both sides could just scrap their nuclear missiles, and meaning it. The Soviet leadership realised they were playing with someone who was far wealthier, who could more or less bet anything he pleased at any point in the game, at whim. If you’re on welfare, and you’re playing poker with a millionaire who has just spent €10,000 in front of you on a tie because he didn’t like the one he was wearing, and he’s offering at the same time to stop the game, it’s not clear what you should best do but stopping right now must seem an attractive option.

And, of course, if Kennan was right, which apparently everyone now thinks he was, then the collapse would have happened anyway, with or without SDI. But it might have taken a bit longer. Then of course there was that bit about taking down a wall in Berlin that might have had something to do with it.



The Accident to SpaceShip Two

3 08 2015

Alister Macintyre noted in the Risks Forum 28.83 that the


US National Transportation Safety Board (NTSB) released results of their investigation into the October 31, 2014 crash of SpaceShipTwo near Mojave, California.

The NTSB has released a preliminary summary, findings and safety recommendations for the purpose of holding the public hearing on July 28, 2015. All those may be modified as a result of matters arising at the hearing. This is standard procedure for the Board.

Their summary of why the accident happened is


[SpaceShip2 (SS2)] was equipped with a feather system that rotated a feather flap assembly with twin tailbooms upward from the vehicle’s normal configuration (0°) to 60° to stabilize SS2’s attitude and increase drag during reentry into earth’s atmosphere. The feather system included actuators to extend and retract the feather and locks to keep the feather in the retracted position when not in use.

After release from WK2 at an altitude of about 46,400 ft, SS2 entered the boost phase of flight. During this phase, SS2’s rocket motor propels the vehicle from a gliding flight attitude to an almost-vertical attitude, and the vehicle accelerates from subsonic speeds, through the transonic region (0.9 to 1.1 Mach), to supersonic speeds. ….. the copilot was to unlock the feather during the boost phase when SS2 reached a speed of 1.4 Mach. …. However, …. the copilot unlocked the feather just after SS2 passed through a speed of 0.8 Mach. Afterward, the aerodynamic and inertial loads imposed on the feather flap assembly were sufficient to overcome the feather actuators, which were not designed to hold the feather in the retracted position during the transonic region. As a result, the feather extended uncommanded, causing the catastrophic structural failure.

This, the Board notes, represents a single point of catastrophic failure which could be instigated, was in this case instigated, by a single human error.

A hazard analysis (HazAn) is required by the FAA for all aerospace operations it certifies. It classifies effects into catastrophic, hazardous, major, minor and “no”, and certification (administrative law) requires that the probability of events in certain classes is ensured to be sufficiently low, through avoidance or mitigation of identified hazards.

HazAn is a matter of anticipating deleterious events in advance. The eternal questions for HazAn are:

  • Question 1. Did you think of everything? (Completeness)
  • Question 2. Does your mitigation/avoidance really work as you intend?

These questions are very, very hard to answer confidently. Imperfect HazAns are almost inevitable in novel operations. In aviation, sufficient experience has accumulated over the decades to ensure that the HazAn process fits the standard kinds of kit and operations and the answers to the questions are to a close approximation yes-yes. In areas in which there is no experience, for example use of lithium-ion batteries for main and auxiliary electric-power storage in the Boeing 787, answers appeared to be no-no . In commercial manned spaceflight, there is comparatively a tiny amount of experience. Certification of a new commercial transport airplane takes thousands of hours. Problems are found and usually fixed. SS1 and SS2 have just a few hours in powered spaceflight so far.

As soon as the accident happened it was almost inevitable that the answer to either Question 1 or Question 2 was “no”. The NTSB summary doesn’t actually tell us whether it was known that unlocking the booms too early would overstress the kit, but given Scaled Composites’ deserved reputation, as well as the strong hint from the NTSB that human factors were not sufficiently analysed, I would guess that the answer is yes; and the answer to Question 2 is partially no: the mitigation works unless the pilot makes an error under the “high workload” (performing many critical tasks under physical and cognitive stress) of transonic flight.

I emphatically don’t buy Macintyre’s suggestion that anyone “cut corners” on test pilot training and HazAn.

These are brand-new operations with which there is very little experience and (contrary to marketing) are inevitably performed at higher risk than operations with thousands or millions of hours accumulated experience. Nobody, in particular no one at Scaled, messes around in such circumstances. Scaled has a well-deserved reputation over three decades for designing radically new aerial vehicles to enviably high standards of safety. But things do sometimes go wrong. Voyager scraped a wingtip on takeoff and nearly didn’t make it around the world (they had 48kg of fuel remaining when they landed again at Edwards after nine days of flight in December 1986, enough only for a couple hours more). Three people were killed during a test of a rocket system in 2007 which was based on a nitrous oxide oxidiser, apparently a novel technology. OSHA investigated. An example of some public commentary is available from Knights Arrow. Scaled has been owned by Northrop Grumman since 2007 (before the rocket-fuel accident). And now a test pilot has lost his life and the craft by performing an action too early.

It may be more apt to note that, like many such analyses of complex systems with proprietary features, the HazAn for WK2/SS2 space operations is substantial intellectual property, whose value will increase thanks to the NTSB’s suggestions on how to improve it.

The purpose of the NSTB’s investigation is to look minutely at all the processes that enabled the accident and to suggest improvements that would increase the chances of a yes-yes pair of answers to the HazAn questions as well as all other aspects of safety. They said the human factors HazAn could be improved. Since human error was presumed to be the single point of failure, that conclusion was all but inevitable. The NTSB also suggested continuity in FAA oversight – the FAA flight-readiness investigation was carried out by different people for each flight so there was reduced organisational learning. As also some other stuff about how to improve the efficacy of oversight, and organisational learning such as the mishap database. And the NTSB suggested proactive emergency readiness by ensuring a rescue craft is on active standby (it usually was, but this wasn’t the case for the accident flight).

One wonders what else in the HazAn isn’t quite right. There are plenty of places to look (witness the Knights Arrow report above on the fuel choice). It doesn’t mean the HazAn is bad. But it will be improved. And improved, all with the goal of getting to yes-yes.



Volvo Has An Accident

5 06 2015

……. but not the one you thought!

Jim Reisert reported in Risks 28.66 ( Volvo horrible self-parking car accident) on a story in fusion.net on 2015-05-26 about a video of an accident with a Volvo car, apparently performing a demo in the Dominican Republic. The fusion.net story is by Kashmir Hill. Hill says “….[the video] is terrifying“. The video is linked/included in the piece.

The video shows a Volvo car in a wide garage-like area, slowly backing up, with people standing around, including in front of the vehicle. The car stops, begins to move forward in a straight line, accelerates, and hits people who did not attempt to move out of the way. Occupants are clearly visible in the car. The video is about half a minute long.

I didn’t find it terrifying at all. At first glance, I found it puzzling. Why didn’t people move out of the way? They had time.

Fusion reports comments from Volvo. I looked the story up using Google. Lots of articles, many of them derivative, and a reference to Andrew Pam’s corrective comment in Risks 28.67. From the better articles (in my judgement), one would crudely understand:

  • The car was being driven. What you see is not automatic.
  • It wasn’t a demo of self-parking. It was a purported demo of a collision-avoidance function.
  • The other-car collision-avoidance function is standard; the pedestrian-collision-avoidance function is an optional extra.
  • The demo car was not equipped with this optional function.

However, many of the articles still have “self-parking” in the headline or as part of the URL, and journalists asked why other-car collision-avoidance is standard, but pedestrian-collision-avoidance an optional extra. Surely, some journalists expect us to conclude, it would be more reasonable the other way around?

What Volvo actually said in response to journalists’ queries seems to be reasonable (see below). But they appear not to be controlling the narrative, and that is their accident. The narrative appears to be that they have a self-parking car which may instead accelerate into passers-by unless it is equipped with a $3,000 extra system to avoid doing so. And this is demonstrated on video. And this narrative is highly misleading.

Other-car/truck detection and avoidance is nowadays relatively straightforward. These objects are big and solid, have lots of metal and smooth plastic which reflects all kinds of electromagnetic and sound waves, and they behave in relatively physically-limited ways. People, on the other hand, are soft and largely non-metallic, with wave-absorbent outer gear, and indulge in, ahem, random walks. It’s a harder detection problem, and it is thereby much harder to do it reliably – you need absolutely no false negatives, and false positives are going to annoy driver and occupants. Such kit inevitably costs something.

But there is a laudable aspect to this commentary. Some, even many, journalists apparently think that pedestrian-collision avoidance systems should be standard, and are more important than other-car collision avoidance. I wish everybody thought like that!

Ten years ago, almost nobody did. I recall an invited talk by a senior staff member of a major car company at the SAFECOMP conference in Potsdam in 2004, about their collision-avoidance/detection/automatic-communication-and-negotiation systems and research. 45 minutes about how they were dealing with other vehicles. I asked what they were doing about pedestrians and bicycles. A 5-second reply: they were working on that too.

Pedestrians are what the OECD calls “vulnerable road users”. While accident rates and severities have been decreasing overall for some years, accident rates and severities for vulnerable road users have not – indeed, in some places they have been increasing. Here is a report from 17 years ago. The Apollo program, which is joint between the WHO and the EU, has a policy briefing ten years later (2008).

I am mostly a “vulnerable road user”. I have no car. My personal road transport is a pedelec. Otherwise it’s bus or taxi. Bicycle and pedelec road users need constantly to be aware of other road users travelling too fast for the conditions and posted speed limits, too close to you, and about to cut you off when you have right of way. As well as occasional deliberately aggressive drivers. All of which is annoying when you’re sitting inside another well-padded and designedly-collapsible shell, but means serious injury or death if you’re not.

I am all for people thinking that vulnerable-road-user detection and avoidance systems should be standard equipment on automotive road vehicles.

There are similar reports to that in Fusion also in:

as well as elsewhere. I like Caldwell’s Slashgear article far more than the others.

Andrew Del-Colle deals out a lengthy corrective in both Road & Track and in Popular Mechanics.

Three Volvo spokespeople are quoted in these articles: Johan Larsson (Fusion, and derivatively The Independent), Stefan Elfstroem (Slashgear and Money) and Russell Datz (Daily Mail). Volvo’s comment is approximately:

  • The car was equipped with a system called “City Safe” which maintains distance from other cars.
  • City Safe also offers a pedestrian-detection system, which requires additional equipment and costs extra money
  • The car was not equipped with this additional system
  • The car appears to be performing a demo. It is being driven.
  • The demo appears to be that of City Safe, not of the self-parking function.
  • The car was apparently being driven in such a way that neither of these systems was operational: the human driver accelerates “heavily” forwards.
  • When an active driver accelerates forwards like this, the detection-and-braking functions are not active – they are “overridden” by the driver command to accelerate
  • Volvo recommends never to perform such tests on real humans

All very sensible.

One major problem which car manufacturers are going to have is that, with more and more protective systems on cars, there are going to be more and more people “trying them out” like this. Or following what John Adams calls “risk homeostatis”, in driving less carefully while relying on the protective functions to avoid damage to themselves and others. I am also sure all the manufacturers are quite aware of this.



Cybersecurity Vulnerabilities in Commercial Aviation

18 04 2015

The US Government Accounting Office has published a report into the US Federal Aviation Administration’s possible vulnerabilities to cyberattack. One of my respected colleagues, John Knight, was interviewed for it. (While I’m at it, let me recommend highly John’s inexpensive textbook Fundamentals of Dependable Computing for Software Engineers. It has been very well thought through and there is a lot of material which students will not find elsewhere.)

None of what’s in the report surprises me. There are three main points (in the executive summary).

First, the GAO suggests the FAA devise a threat model for its ground-based ATC/ATM systems. (And, I presume, that the FAA respond to the threat model it devises.) I am one of those people who consider it self-evident that threat models need to be formulated for all sorts of critical infrastructure. One of the first questions I ask concerning security is “what’s the threat model?“. If the answer is “there isn’t one” then can anybody be suprised that this is first on the list?

Lots of FAA ground-based systems aren’t geared to deal with cybersecurity threats – many of them are twenty or more years old and cybersecurity wasn’t an issue in the same way it is coming to be. Many systems communicate over their own dedicated networks, so that would involve a more or less standard physical-access threat model. But many of them don’t. Many critical inter-center communications are carried over public telephone lines and are therefore vulnerable to attacks through the public networks, say on the switches. Remember when an AT&T 4ESS switch went down in New York almost a quarter century ago? I can’t remember if it was that outage or another one during which the ATCOs called each other on their private mobiles to keep things working. A human attacker trying to do a DoS on communications would probably try to take out mobile communications also. (So there’s the first threat for the budding threat model – a DoS on communications)

If the FAA don’t want to do a model themselves, couldn’t they just get one from a European ally and adapt it? The infrastructures aren’t that dissimilar on the high level and anything would be a help initially.

Second, when the FAA decided they were OK with the manufacturer putting avionics and passenger in-flight entertainment (IFE) data on the same databuses on the Boeing 787, many of us thought this premature and unwise and said so privately to colleagues (one of them even found the correspondence). We have recently had people claim to be able to access critical systems through the IFE (see below). I have reported on one previous credible claim on vulnerabilities in avionics equipment.

The GAO is suggesting that such configurations be thought through a little more thoroughly. The basic point remains: isn’t it abundantly clear that the very best way to ensure as much non-interference as possible is physical separation? Who on earth was thinking a decade ago that non-interference wouldn’t be that much of an issue? Certainly not me.

Third, the other matters the GAO addressed are organisational, which is important of course for the organisation but of little technical interest.

Concerning accessing critical avionics systems through the IFE, Fox News reports that Cyber security researcher Chris Roberts was pulled off a US commercial flight and interrogated by the FBI for a number of hours.

A colleague commented that “they are going after the messenger.” But let’s look at this a little more carefully.

Chris Roberts is CTO and founder of One World Labs in Denver. Staff at One World consist of a CEO who is a lawyer, a CFO and a VP of sales and marketing, and two technical employees, one of whom is Roberts, who is the company founder. The board appears to be well-balanced, with a former telecommunications-industry executive and a military SIGINT expert amongst others.

One World claims to have the “world’s largest index of dark content“, something called OWL Vision, to which they apparently sell access. One wonders how they manage to compile and sustain such a resource with only two technical people in the company, but, you know, kudos to them if it’s true.

According to the first line of his CV, Roberts is “Regarded as one of the world’s foremost experts on counter threat intelligence within the cyber security industry“. His CV consists of engagements as speaker, and press interviews – there is nothing which one might regard as traditional CV content (his One World colleagues provide more traditional info: degrees, previous work experience and so on). His notable CV achievements for 2015 are a couple of interviews with Fox.

Apparently he told Fox News in March, quoted in the article above, “We can still take planes out of the sky thanks to the flaws in the in-flight entertainment systems. Quite simply put, we can theorize on how to turn the engines off at 35,000 feet and not have any of those damn flashing lights go off in the cockpit…… If you don’t have people like me researching and blowing the whistle on system vulnerabilities, we will find out the hard way what those vulnerabilities are when an attack happens.

Read that first sentence again. He can take planes out of the sky due to flaws in the IFE, he says. Does it surprise anybody that the FBI or Homeland Security would want to find out exactly what he means? Maybe before he gets on a flight, taking some computer equipment with him? It is surely the task of security services to ensure he is not a threat in any way. If you were a passenger on that airplane, wouldn’t you like at least to know that he is not suicidal/paranoid/psychotic? In fact, wouldn’t you rather he got on the plane with a nice book to read and sent his kit ahead, separately, by courier?

It has been no secret for fourteen years that if you are going to make public claims about your capabilities you can expect security agencies nowadays to take them at face value. Would we want it otherwise?

Let us also not ignore the business dynamics. You have read here about a small Denver company, its products and claimed capabilities. I am probably not the only commentator. All at the cost to a company employee of four hours’ interrogation and the temporary loss of one laptop. And without actually having to publish their work and have people like me analyse it.



Germanwings 9525 and a potential conflict of rights

11 04 2015

Work continues on the investigation into the crash of Germanwings Flight 9525. I note happily that news media are reverting to what I regard as more appropriate phraseology. Our local newspaper had on Friday 27th March two-word major headline “Deadly Intention“, without quotation marks, and the BBC and Economist were both reporting as though an First Officer (FO) intention to crash the plane was fact. Written media are now reverting to what most of us would consider the formally more accurate “suspected of” phraseology. (For example, see the German article below.)

Flight International / Flightglobal had as main editorial in the 31 March – 6 April edition a comment deploring the way matters concerning the Germanwings crash are being publicly aired.

I read Flight as suggesting the Marseille procureur was abrupt. Many of us thought so at the time. An article from this week’s Canard Enchaine shows that part of the French (anti-)establishment agrees with that assessment, but for different reasons, concerning some political manoeuvring.

But Flight gets the logic wrong. The procureur was not announcing his “conviction” that the FO was “guilty” of…. whatever; neither was the announcement “surreal” by virtue of the fact that the FO was dead.

  • The procureur was not announcing the degree of his belief. He was making an accusation, in the usual formal manner using the usual formal terminology;
  • He was not judging the FO as “guilty”; that’s neither his job nor his right and he is obviously clear about that. Only a court can pronounce guilt.
  • It is not surreal: as Flight should be aware, in France prosecutions are brought, and are sometimes successful, after accidents in which everyone on board died, viz. Air Inter and Concorde. There is a case to be made that people at the airline had overlooked medical information on the FO which (would have) rendered him formally unfit to fly. There is the further possibility that there existed medical information relevant to his fitness to command a commercial airliner which was not shared with the relevant parts of the airline and/or regulator.

There is also a procedural aspect to the formal announcement by the Marseille procureur on Thursday 26th March which the Flight editorial ignores. Everyone knows the importance of preserving and gathering evidence quickly, in this case evidence about the FO. Presumably everyone agrees that it is a good thing. In order to set that process in motion, there need to be formal legal actions undertaken. The crash event took place within the jurisdiction of Marseille. Formal proceedings therefore need to be opened in Marseille and German legal authorities informed and cooperating in those proceedings in order to gather and preserve evidence in Germany. Obviously this needs to be done ASAP, because who knows how other people with immediate access to such materials are going to react. The question is whether proceedings have to be opened at a florid press conference. In this case it might have been hard to avoid.

In its editorial, Flight suggests the BEA is in a more appropriate position to gather evidence than prosecutors, and that they should be allowed to get on with that job. The other industry stalwart, Aviation Week and Space Technology, also says in a recent editorial that “We find more objectivity in accident investigators’ reports than in prosecutors’ statements.” I disagree. State attorneys’ offices and police are far more experienced at securing the kind of evidence likely to be relevant to the main questions about this crash than are aircraft safety investigators.

It seems to be the case that medical information relevant to the FO’s fitness was not distributed appropriately. For example, information concerning a 2009 depressive episode. The airline knew about this episode, and subsequently flight doctors have regularly judged him fit to fly (he regularly obtained a Class 1 medical certificate according to the annual renewal schedule). However, in April 2013 Germany brought into law the EU regulation that the regulator (LBA) must be informed and also determine fitness when an applicant has exhibited certain medical conditions. The LBA has said that it wasn’t so informed of the 2009 episode. (Here is a German news article on that, short and factual. It also laudably uses the “suspected” terminology.) If so, that seems to be an operational error for which the FO was not at all responsible in any way.

It is exactly right that the Marseille procureur along with his German counterparts is looking at all that and it is also right that that was undertaken very quickly.

There is a wider question. The confidentiality of German medical information is all but sacrosant. Its confidential status overrides many other possibly conflicting rights and responsibilities, and I understand this has been affirmed by the Constitutional Court. Pilots have an obligation to self-report, so medical confidentiality has not come into conflict with duty of care – yet. But what about a case when medical conditions indicating unfitness to fly are diagnosed, but the pilot-patient chooses not to self-report? The pilot flies for an airline; the airline has a duty of care. If something happens to a commercial flight which this pilot is conducting, which causes harm to the airline’s clients (passengers) and others (people and objects on the ground near a CFIT; relatives of passengers), then the airline has obviously not fulfilled its duty of care to those harmed: the pilot should not have been flying, but was. However, equally obviously, the airline was unable to fulfil its duty of care: it was deprived of pertinent knowledge.

Personality assessments are used by some employers in the US in evaluating employees. See, for example, the survey in the second, third and fourth paragraphs of Cari Adams, You’re Perfect for the Job: The Use and Validity of Pre-employment Personality Tests, Scholars journal 13; Summer 2009, along with the references cited in those paragraphs. It is not clear to me at this point whether it is legal in Germany to require potential employees to undergo such tests. (As I have indicated previously, I do think that some tests, such as MMPI, could identify extreme personality characteristics, which could be associated with future inappropriate behaviour when operating safety-critical systems, in some cases where these would not necessarily be picked up in the usual employee interviews.)

I suggest that this employee medical confidentiality/employer’s duty of care issue is a fundamental conflict of rights that won’t go away. It may be resolved but it cannot be solved. It may turn out that it is currently not so very well resolved in Germany. I would judge it a good thing if this one event opens a wider debate about the conflict.



Thoughts After 4U 9525 / GWI18G

4 04 2015

It is astonishing, maybe unique, about the Germanwings Flight 4U 9525 event how quickly it seems to have been explanatorily resolved. Egyptair Flight 990 (1999) took the “usual time” with the NTSB until it was resolved, and at the end certain participants in the investigation were still maintaining that technical problems with elevator/stabiliser had not been ruled out. Silk Air Flight 185 (1997) also took the “usual time” and the official conclusion was: inconclusive. (In both cases people I trust said there is no serious room for doubt.) There are still various views on MH 370, and I have expressed mine. However, it appears that the 4U 9525/GWI18G event has developed a non-contentious causal explanation in 11 days. (I speak of course of a causal explanation of the crash, not of an explanation of the FO’s behaviour. That will take a lot longer and will likely be contentious.)

A colleague noted that a major issue with cockpit door security is how to authenticate, to differentiate who is acting inappropriately (for medical, mental or purposeful reasons) from who isn’t. He makes the analogy with avionics, in which voting systems are often used.

That is worth running with. I think there is an abstract point here about critical-decision authority. Whether technical or human, there are well-rehearsed reasons for distributing such authority, namely to avoid a single point of decision-failure. But, as is also well-rehearsed, using a distributed procedure means more chance of encountering an anomaly which needs resolving.

What about a term for it? How about distributed decision authority, DDA. DDA is used in voted-automatics, such as air data systems. It is also implicit in Crew Resouce Management, CRM, a staple of crew behavior in Western airlines for a long time. Its apparent lack has been noted in some crew involved in some accidents, c.f., the Guam accident in 1997 or the recent Asiana Airlines crash in San Franciso in 2013. It’s implicitly there in the US requirement for multiple crew members at all times in the cockpit, although here the term “DDA” strains somewhat – a cabin crew member has no “decision authority” taken literally but rather just a potentially constraining role.

There are also issues with DDA. For example, Airbus FBW planes flew for twenty years with air data DDA algorithms without notable problems: just five ADs. Then in the last seven years, starting in 2008, there have been over twenty ADs. A number of them modify procedures away from DDA. They say roughly: identify one system (presumably the “best”) and turn the others off (implicitly, fly with just that one deemed “best”). So DDA is not a principle without exceptions.

A main question is what we need to do, if anything.

For example, consider the measures following 9/11. Did we need them and have they worked? Concerning need; I would say a cautious yes. (Although I note the inconvenience has led me to travel around Europe mainly by rail.) The world seems to contain more organisations with, to many of us, alien murderous ideologies. 9/11 was a series of low-technology, robust (multiple actors per incident) hijackings. Attempts have been made since to destroy airliners with moderate-technology and solitary actors (shoe bomber, underpants bomber, printer cartridge bombs) but these have all failed. They are not as robust; in each case, there was just one agent, and moderate-technology is nowhere near as reliable as low-technology: bombs are more complex than knives. One of them could have worked, but on one day in 2001 three out of four worked. It seems to me that, in general, we are controlling hijackings and hostile deliberate destruction moderately well.

After 4U 9525 do we need to do something concerning rogue flight crew? Hard to say. With the intense interest in the Germanwings First Officer’s background it seems to me likely that there will be a rethink of initial screening and on-the-job crew monitoring. Talking about the pure numbers, seven incidents in 35 years is surely very low incidence per flight hour, but then it’s not clear that statistics are any kind of guide in extremely rare cases of predominantly purposeful behavior. For example, how do we know there won’t be a flurry of copycat incidents? (I suspect this might be a reason why some European carriers so quickly instituted a “two crew in cockpit at all times” rule.)

What about classifying airlines by safety-reliability? A cursory look suggests this might not help much. Three, almost half, of murder-suicide events have been with carriers in Arnold Barnett’s high safety grade. Barnett has published statistical reviews of world airline safety from 1979 through recently (see his CV, on the page above, for a list of papers). His papers in 1979 and 1989 suggested that the world’s carriers divided into two general classes in terms of chances of dying per flight hour or per flight. Japan Air Lines, Silk Air (the “low-cost” subsidiary of Singapore Airlines) and Germanwings (the “low-cost” subsidiary of Lufthansa) are all in the higher class.

I consider it certain that DDA with flight crew will be discussed intensively, including cockpit-door technology. Also flight-crew screening and monitoring. What will come of the discussions I can’t guess at the moment.



Germanwings Flight 4U 9525

24 03 2015

19:15 CEST on Friday 3rd April

The BEA have recovered the Flight Data Recorder and read it. They issued a communiqué. Here is my translation of the pertinent paragraph:

At a first reading it appears that the pilot in the cockpit used the autopilot to command a descent to an altitude of 100 ft, then, numerous times during the descent, the pilot modified the autopilot setting to enhance the rate of descent.

So

  • There was an initial action to initiate a descent. This was surmised from the ADS-B readouts from Flightradar24, which showed an AP setting of FL 380, then an intermediate altitude setting of some 13,008 ft QNH, then an altitude setting of 96ft QNH (=100 ft, the lowest setting possible in the FCU). This very strongly suggested manual setting of the AP through the FCU, rather than, say, an automatic setting via the FMS. Apparently the manual setting action was heard on the CVR readout. The FDR confirms this manual setting action.
  • There followed mutiple subsequent manual actions coherent with the first action, to enhance the rate of descent.
  • I shall interpolating the communiqué to infer there were o manual actions inconsistent with this command to descend to 100ft.

I don’t see how multiple coherent actions over a period of time is consistent with the kind of brain event which I was considering up to now as a possibility. Mild epileptic-type events or stroke do not lead to coherent apparently-purposeful action. The actions usually don’t cohere at all.

This leaves just one possibility from the six I listed. The very-much-most-likely possibility is deliberate action, namely murder-suicide.

I am at best an amateur at psychology, but I did look hard at the DSM-IIIR a quarter-century ago and like to think I have kept in touch. This deliberate action was either extremely aggressive or extremely unempathetic towards the 149 other people involved. That surely points towards personality disorders which amateurs like me imagine could have been picked up by something like the MMPI. ‘Nuff said from me; others might continue this line of thought.

This is the worst mass murder in recent German history by far, and the fourth worst in recent European history (Srebrenica 1995 is by far the worst, 8,000+ lives taken; then comes Lockerbie, 1988, 270 lives taken; Madrid train bombings, 2004, 191 lives taken). Note that those other three were intended or actual acts of war.

07:40 CET on Friday 27th March

Two important points today. First, investigators have detailed apparently-deliberate actions by the First Officer to initiate a descent and keep the Captain from reentering the cockpit. Colleagues with some experience have said that it is premature to rule out actions in course of experiencing a stroke (Schlaganfall in German). Second, the workings of the cockpit door locking mechanism, and the policies concerning a pilot leaving the cockpit have come into question. I explain the operation of the A320 cockpit locking below.

First, terminology. Everybody is writing “pilot”, “co-pilot”. The usual term is Captain (CAP) and First Officer (FO), referring to the command roles. The term “pilot” informally refers to the person flying the airplane at a given time, known as Pilot Flying (PF). The other cockpit-crew member is the Pilot Non-Flying (PNF). In this incident, the PF appears to have been the FO.

French investigators have said that the Captain left the cockpit, with the First Officer, the Pilot Flying, remaining at the controls, alone in the cockpit. That shortly afterwards a descent was initiated by – I am here interpolating with some knowledge of the A320 – dialing an “open descent” into the FCU (the autopilot control unit just under the glare shield). An A320-rated colleague says you can put, say “100ft” target altitude, and activate, and the aircraft will go into open descent with engines at flight idle at about 4,000 feet per minute right down to 100 ft altitude, i.e. the ground here. In other words, twist and pull one knob.

I would emphasise here that such autopilot systems are not unique to the Airbus A320 but are to be found on most commercial transport aircraft nowadays.

Now to the first major issue. Concerning stroke versus deliberation action, a colleague was present when someone 29 years old had a haemorrhagic stroke.

Inside 30 minutes he went from conversing like normal; to weirdly reticent and uncoordinated; to silently sitting on a bed, clutching an aspirin bottle like a crazy person, totally unresponsive to the world. And in that time he managed to open a laptop and hammer-out an email full of utter nonsense, all for reasons that are still totally lost on him.

During such an event, one may well continue “breathing normally”, as the French press conference is reported to had said the First Officer did.

So it seems to be possible that a confused FO in the course of experiencing a stroke dialed an open descent into the FCU, maybe imagining he has to land. I am a little surprised that medical experts have not yet pointed such phenomena clearly out. It does suggest that concluding murder-suicide is premature at this stage.

It has also been suggested that the FO secured the cockpit door against being opened from outside (that is, he activated the third function below). Evidence for this is that the emergency-entry thirty-second buzzer did not sound. Maybe. No one has yet said whether there is evidence that the Captain in fact tried to activate the emergency-entry function via PIN (the second function below). Apparently he knocked on the door and continued to knock; but nothing else has been said.

Second major issue: the cockpit door locking. The cockpit door on the A320 in normal flight is permanently locked. There are three technical functions. First function: on the central console between the pilots there is a toggle switch which opens the door when it is used: it must be held in “open” position and reverts to door-locked when released. I emphasise: a pilot must “hold the door switch open” for the door to open, and it locks again when heshe releases the switch. Second function: there is a keypad mounted in the cabin outside the cockpit by the cockpit door. Someone standing outside can “ring” (press a key) to activate a ringing tone in the cockpit. Or, of course, knock on the door. The Pilot Flying (or another person) can then use the first function to open the door, and the person outside can then enter. Suppose that does not happen for some reason. Then the person outside can enter a PIN code into the keypad (heshe must have knowledge of the PIN code). A warning sound activites in the cockpit for thirty seconds, at the end of which the door unlocks for 5 seconds, when the waiting person can enter, and then reverts to locked. This second function addresses the issue of the incapacitation of the pilot or occupation with other urgent tasks. The third function is a deactivation: by using a switch in the cockpit, the second function can be deactivated for a preselected period of time (the Operating Manual says between five and twenty minutes; colleagues understand that on the Germanwings aircraft it was five minutes). That means that for this period of time, even use of the PIN code outside does not unlock the cockpit door for entry. The cockpit door can still be unlocked during this time by using the first function, the “unlock” toggle switch. This third function addresses the possibility that a hostile person could physically threaten someone outside the cockpit with knowledge of the PIN code (say CAP or FO who went to the toilet in the cabin) in order to gain entry via the second function.

This is the operation of the door locking/unlocking functions on the Airbus A320. We have not checked and compared with other aircraft.

I am told there is a rule in the USA that there must be two crew members in the cockpit at all times. So if CAP or FO leaves, a cabin-crew member must enter and stay until the cockpit crew member returns. This is not necessarily so in European commercial flying. As far as I know it is consistent with Germanwings operating rules that the PNF can leave the cockpit briefly under certain conditions, leaving just the PF within. (I omit discussion here of why’s and wherefores.)

It seems almost certain that there will be considerable technical discussion of whether these cockpit-door-locking procedures and rules are appropriate or need to be modified. I observe that the BBC has listed three apparent-murder-suicide events in commercial flight the last few decades (I do not know of more), and this might be a fourth (I emphasise again the word “might”). And in at least one of those incidents, the cockpit remained accessible to those outside. In contrast, on one day alone in 2001, four cockpit crews were overwhelmed by attackers from the cabin, and since the door-locking rules have been in force, none subsequently have been. And before that day in 2001, there were many instances of hostile takeover of an aircraft (“hijacking”). So arguments for and against particular methods and procedures for locking cockpit doors in flight are not trivial.

Finally, there seems to be a mistake in one of my observations below. The flight path corresponds more nearly to a 6° descent angle. This is steep, but within the normal range. London City airport has an approach glide path of 6°, and A320-series aircraft fly out of there (although, I believe, not the A320 itself). (Calculation, for the nerds like me: 1 nautical mile = about 6,000 ft so 1 nm/hr = about 100 feet per minute (fpm). So 400 knots airspeed = about 40,000 fpm. Flying at 400 kts and descending at 4,000 fpm is a slope of 1 in 10, which corresponds roughly to one-tenth of a radian which is about 6°.)

07:27 CET on Thursday 26th March

John Downer suggested the possibility of

  • an inadvertent behavioral event that did not obviously fit into my classification below. He quotes a colleague on the regular occurrence of the highly unusual: “as Scott Sagan put it: stuff that’s never happened before happens all the time“.

Inadvertent behaviour would likely involve one pilot leaving the cockpit, and the other suffering a medical event. I could then see two ways to achieve the regular flight path: engaging descent mode in the FCU at 4,000 fpm or 3° descent profile (note: Friday 27th March – I think this should be 6°!) or retarding the throttles in speed-hold mode.

Since the throttles are forward at high cruise, I think that slumping on them would cause them to advance, if anything, not to retard. John informs me that, during a stroke, people can become very confused. Thereby manipulating the FCU or retarding the throttles does not seem out of the question. Many thanks to John for pointing out this possibility which didn’t fit into my classification below!

Karl Swarz made us aware of the NYT report Germanwings pilot was locked out of cockpit before crash by Nicola Clark and Dan Bilefsky. Karl had sent the note before my conversation with John, but I hadn’t yet read it. It seems this is a scoop – there is also a similar report today in The Guardian but it cites the NYT.

There is some preliminary unconfirmed information from the CVR read out. One pilot did leave the cockpit and could not reenter during the event. There is, as currently analysed, no indication of a reaction from the pilot flying. We may presume that the analysis will become much more precise. It seems the commentators cited by the NYT are ruling out cabin depressurisation; that eliminates one of the (now) six possibilities. It seems to me likely that many of the others will be quickly ruled out.

19:04 CET on Wednesday 25th March.

Update: there is no more information on the behavior of the flight than I reported yesterday (below).

There is discussion of possibilities, and whether my classification is right. It is appropriate and necessary that there should be such discussion. Here, in the next paragraph, is some.

A colleague has suggested that the crew could have been overcome by carbon monoxide in the bleed-air from the engines (which is used to pressurise the aircraft). It has happened before that crew has been overcome by something. In each case, the flight has continued as configured until fuel is exhausted, and then come down. So if this happened here, why did the flight not continue at FL380 until the fuel was exhausted? Another colleague has suggested that the descent rate almost exactly corresponds to a descent profile of 3°, which is normal descent profile for (say) an ILS approach. OK, but why would a crew in cruise flight, continuing cruise enroute to Düsseldorf, change the autopilot setting to a descent profile?

Somebody said on Twitter this morning, in response to my interview with a radio station in Hessen, that enumerating possibilities is speculation and one should just let the investigators do their job (and presumably deliver results).

First, this misunderstands how things are investigated. Speculation is a major component of investigation – one supposes certain things, and tries to rule them out or keep them as active possibilities. And one carries on doing this until possibilities are reduced as far as possible, ideally down to one.

Second, each technology is constrained in behavior. Airplanes can’t suddenly turn left and crash into a lane separator. Cars can’t suddenly ascend at 4,000 feet per minute. Bicyles can’t stop responding to input and show you the blue screen of death. How each artefact can behave in given circumstances is constrained. And even further when there is a given partial behavioral profile. Why not attempt to write that down? If it’s wrong, someone will say so and it can be corrected.

Third, such a process obviously works most efficiently when experts with significant domain knowledge attempt to write it down and other such experts correct. And most inefficiently when people with little domain knowledge write down what they are dreaming, and attempt to argue with those who suggest their dreams are unrealistic. It’s a social process, which works better or worse, but I see no reason why it should generally be deemed inappropriate. Speculation is a necessary component of explanation.

18:42 CET on Tuesday 24th March.

Here is what I think I know at this point.

Germanwings Flight 4U 9525 has crashed against an almost vertical cliff in the Alps. The Flight was enroute from Barcelona to Düsseldorf and took the route which had been flown the day before. At about 0931Z (=10:31CET) he was at FL380 in level flight and started a descent at a rate of about 4,000 feet per minute, which continued more or less constant until about 7,000 ft altitude, when he levelled off. The descent lasted until 0941Z (=10:41CET).

He continued level for either 1 minute or 11 minutes. Contact was reported to have been lost at 0953Z. Such basic facts are often unclear in the first 24 hours, even though they appear to come from reliable sources.

I see five possible contributing events, not all mutually exclusive:

  • Loss of cabin pressure. A crew should react by starting a descent at about this rate, but the descent should have stopped before 7,000 ft altitude;
  • Fire. The crew would wish to descend and land as soon as possible. Emergency descents in excess of 4,000 feet per minute are possible, especially at higher altitudes, and a crew in a hurry to land, as in a case of fire on board, could have been expected to do so;
  • Dual engine problems, maybe flameout. Descent at best-glide speed, though, I have been informed is somewhere between 2,000 and 3,000 feet per minute. One would not wish to come down faster, since the more time one has to troubleshoot, and then to try to restart, the better
  • An air data problem affecting the handling of the aircraft. Recent air data problems with these aircraft, as well as with A330 and A340 aircraft that have almost-identical air data sensorics, during cruise and other phases of flight have occurred since 2008 and there have been a series of Airworthiness Directives from EASA and the FAA in this period, including recent Emergency Airworthiness Directives within the last few months. However, one would expect aircraft behavior associated with such a problem not to last nine minutes at constant, moderate rate of descent
  • Hostile – and criminal – human action on board

I’ve already given a TV interview in which I only mentioned four of these five. Such is life. Are there more?

In a number of these cases, one would expect a crew to turn towards a nearby adequate airport for landing, such as Marseille. One would certainly not expect them to continue flying towards high mountains! In particular, towars the Alps at 7,000 ft. So the question is raised whether the crew was or became incapacitated during the event.

I’ll update when I know more.

PBL 1800Z/1900CET



Fault, Failure, Reliability Definitions

4 03 2015

OK, the discussion on these basic concepts continues (see the threads “Paper on Software Reliability and the Urn Model”, “Practical Statistical Evaluation of Critical Software”, and “Fault, Failure and Reliability Again (short)” in the System Safety List archive.

This is a lengthy-ish note with a simple point: the notions of software failure, software fault, and software reliability are all well-defined, although it is open what a good measure of software reliability may be.

John Knight has noted privately that in his book he rigorously uses the Avizienis, Laprie, Randell, Landwehr IEEE DSC 2004 taxonomy (IEEE Transactions on Dependable and Secure Computing 1(1):1-23, 2004, henceforth ALRL taxonomy), brought to the List’s attention by Örjan Askerdal yesterday, precisely to be clear about all these potentially confusing matters. The ALRL taxonomy is not just the momentary opinion of four computer scientists. It is the update of a taxonomy on which the authors had been working along with other members of IFIP WG 10.4 for decades. There is good reason to take it very seriously indeed.

Let me first take the opportunity to recommend John’s book on the Fundamentals of Dependable Computing. I haven’t read it yet in detail, but I perused a copy at the 23rd Safety-Critical Systems Symposium in Bristol last month and would use it were I to teach a course on dependable computing. (My RVS group teaches computer networking fundamentals, accident analysis, risk analysis and applied logic, and runs student project classes on various topics.)

The fact that John used the ALRL taxonomy suggests that it is adequate to the task. Let me take John’s hint and run with it.

(One task before us, or, rather, before Chris Goeker , whose PhD topic is vocabulary analysis, is to see how the IEC definitions cohere with ALRL. I could also add my own partial set to such a comparison. )

Below is an excerpt from ALRL on failure, fault, error, reliability and so forth, under the usual fair use provisions.

It should be clear that a notion of software failure as a failure whose associated faults lie in the software logic is well defined, and that a notion of software reliability as some measure of proportion of correct to incorrect service is also possible. What the definitions don’t say is what such a measure should be.

This contradicts Nick Tudor’s suggestion in a List contribution yesterday that “software does not fail ….. It therefore makes no sense to talk about reliability of software“. Nick has suggested, privately, that this is a common view in aerospace engineering. Another colleague has suggested that some areas of the nuclear power industry also adhere to a similar view. If so, I would respectfully suggest that these areas of engineering get themselves up to date on how the experts, the computer scientists, talk about these matters, for example ALRL. I think it’s simply a matter of engineering responsibility that they do so.

In principle you can use whatever words you want to talk about whatever you want. The main criteria are that such talk is coherent (doesn’t self-contradict) and that the phenomena you wish to address are describable. Subsidiary criteria are: such descriptions must be clear (select the phenomena well from amongst the alternatives) and as simple as possible.

I think ALRL fulfils these criteria well.

[begin quote ALRL]


The function of such a system is what the system is intended to do and is described by the functional specification in terms of functionality and performance. The behavior of a system is what the system does to implement its function and is described by a sequence of states. The total state of a given system is the set of the following states: computation, communication, stored information, interconnection, and physical condition. [Matter omitted.]

The service delivered by a system (in its role as a provider) is its behavior as it is perceived by its user(s); a user is another system that receives service from the provider. [Stuff about interfaces and internal/external states omitted.] A system generally implements more than one function, and delivers more than one service. Function and service can be thus seen as composed of function items and of service items.

Correct service is delivered when the service implements the system function. A service failure, often abbreviated here to failure, is an event that occurs when the delivered service deviates from correct service. A service fails either because it does not comply with the functional specification, or because this specification did not adequately describe the system function. A service failure is a transition from correct service to incorrect service, i.e., to not implementing the system function. …… The deviation from correct service may assume different forms that are called service failure modes and are ranked according to failure severities….

Since a service is a sequence of the system’s external states, a service failure means that at least one (or more) external state of the system deviates from the correct service state. The deviation is called an error. The adjudged or hypothesized cause of an error is called a fault. Faults can be internal or external of a system. ….. For this reason [omitted], the definition of an error is the part of the total state of the system that may lead to its subsequent service failure. It is important to note that many errors do not reach the system’s external state and cause a failure. A fault is active when it causes an error, otherwise it is dormant.

[Material omitted]

  • availability: readiness for correct service.
  • reliability: continuity of correct service.
  • safety: absence of catastrophic consequences on the
    user(s) and the environment.
  • integrity: absence of improper system alterations.
  • maintainability: ability to undergo modifications
    and repairs.

[end quote ALRL]