SILs, the Safety-Related System Lifecycle and Security Level (Ingo Rolle)

26 04 2016

[Ingo Rolle is the Secretary of the German National Committee responsible for matters concerning IEC 61508 as well as the German National Committee responsible for matters concerning IEC 62443. This is an invited essay. PBL]

IEC 61508:2010 is the international standard for functional safety of electrical, electronic and programmable electronic systems. It applies to the digital subsystems of any industrial system, such as process plants, sensors in safety-critical environments, and even electric toasters. It has specialised derivatives for each branch of engineering, for example IEC 61511 for industrial process plants and EN 50128 for railway systems (for example, train control and signalling).

In recent years, penetration and subversion of computer systems has increased, as has the damage caused by such activities. Engineers are now well aware that their digital-electronic subsystems are open to penetration and subversion of their intended function by unauthorised persons, and these include industrial control systems not necessarily with any network connections with the “outside world” (e.g., not connected to the Internet). One famous example from 2010 is the “Stuxnet” malware which affected the operations of a Iranian uranium processing plant, as well as other sites. “Security” is the general term for measures intended to protect against unauthorised penetration and subversion of function of systems; for digital systems this is often called “IT Security” or “cybersecurity” ,to distinguish it from the physical security which has long been part of the design of industrial plants.

Cybersecurity for industrial control systems (control systems in industrial plant) is being standardised in the series IEC 62443. It defines a notion of Security Level (SL), to indicate the extent to which security analysis and measures must take place for a given subsystem with given function and with respect to one of seven Foundational Requirements (FR). SLs are defined as follows:

  • SL 1: There is to be protection against causal or coincidental violation of function
  • SL 2: … protection against intentional violation using simple means
  • SL 3: … protection against intentional violation using sophisticated means
  • SL 4: … protection against intentional violation using sophisticated means with extended resources

and the FRs are:

  • IAC Identification and authentication control
  • UC Use control
  • SI System integrity
  • DC Data confidentiality
  • RDF Restricted data flow
  • TRE Timely response to events
  • RA Resource availability

The notion of SL is further subdivided into:

  • SL Target : the target Security Level resulting from the risk analysis;
  • SL Achieved: what Security Level has actually been achieved in the real application;
  • SL Capability: what Security Level a specific device attains if properly installed and configured.

So, for example, Preliminary Hazard/Threat Analysis of system S may result in a risk assessment which assigns SL 2 to functional component C with respect to Foundational Requirement DC. Company XYZ has a device which can be used as component C , and they have developed and demonstrated it to SL 4 for FR DC. However, there is a specific “extended resource” R (say, a specific computer system powerful enough to break some sophisticated encryption) which may have access to component C through its environment in S and it is not known whether the encryption of information in C is sufficient to resist persistent use of R in all cases. So in its use in S, the engineers assess component C as attaining SL 3, but not necessarily SL 4. Here, we have SL Target 2, SL Achieved 3, and SL Capability 4 for FR DC. We should mention that, to achieve SL 3 for FR DC, some specific capabilities of the organization which operates component C should be demonstrated.

IEC 61508 has a notion of Safety Integrity Level (SIL). The SIL is applied to so-called “safety functions”, functions which are meant to keep operation of a system acceptably safe. It is a measure of the reliability of a safety function. If the safety function is implemented largely in software, the Software SIL translates into how carefully the system must be developed (and it is assumed that care in development correlates with trustworthiness of the resulting software).

We can compare the approach to Foundational Requirements roughly with the approach to Safety Functions. While the first FRs are defined in common by IEC 62443 for all systems, Safety Functions are determined individually for each system by means of the Preliminary Hazard Analysis.

The differentiation between “target”, “achieved” and “capability” for SILs is not present in IEC 61508. We suggest it is a useful distinction and may be interpreted thus. On the one hand, IEC 61508 states that a SIL is a measure how much a safety function shall reduce a risk. The risk is assessed as one of the first steps in IEC 61508-conformant system development and the SIL assigned as a result of this analysis. This would resemble “target” SIL. On the other hand, SIL is literally defined as an integrity level, which would seem to be the reliability with which the safety function is actually performed in a running system. This would resemble “achieved” SIL. It is common for device manufacturers to state in a product data sheet that their device is appropriate for “SIL 4 applications”. This is something like a statement of SIL “capability”.

IEC 61508 (first version 1998) was written much earlier than IEC 62443 (ISA99), and this differentiation of different notions of SIL obviously did not occur to the original authors. Now, after decades of companies adapting to IEC 61508 as written, it seems to be difficult to bring such notions in to a new version. Nevertheless, the differentiation is obviously helpful, as indicated above. So it seems we may have to add a “virtual index” to each SIL statement in the manner in which IEC 62443 does.

Ingo Rolle

Power Plants and Cyberawareness

23 03 2016

There is a considerable challenge in raising the awareness of engineering-plant personnel about the criticality of the computer systems they might be using.

We addressed some electricity blackouts at the Safety-Critical Systems Symposium 2016. In the 2003 North American blackout, the malfunction of two computer systems on which operators and oversight personnel relied was causal to the outage (by the Counterfactual Test: had the computers functioned as expected, there is every reason to think the day would have been routine, as indicated in the reports). The systems had apparently not been identified as critical, however, and operated accordingly. Neither of the two reports (from the North American Electric Reliability Council NERC, and from the Joint US-Canada Task Force) identified the computer systems as critical, let alone their malfunction as explicitly causal to the outage. Those reports, however, are over a decade old. Things might be different now.

Or not. More recent (October 2015) is an eye-opening report on Cyber Security at Civil Nuclear Facilities, prepared by the International Security Department at the UK Royal Institute of International Affairs (known as Chatham House, former home of the great 18th century British statesman William Pitt, 1st Earl of Chatham).

The report garnered significant press attention, such as from the Financial Times (note this link brings you to a paywall; if alternatively you conduct a Google search and follow the resulting link, you can read the original article), the MailOnline, and ComputerWeekly. However, it seems to have been comparatively overlooked by the computer security community (there has been no note in the RISKS Digest, for example).

Readers of the report will note a lot of low-hanging fruit. Systems are often thought to be “air gapped” from external computer networks and thus invulnerable to intrusion. Interviews for this report were conducted in 2014-2015, four to five years after Stuxnet hit the similarly air-gapped Iranian centrifuge facilities (2010). Besides reports in almost all serious newspapers, there is a wealth of public detail about Stuxnet. One would have thought it would have been taken to heart, nuclear-industry-wide. But maybe not yet.

One awareness vector is international standards. There is an IEC committee, SC 45A, tasked with standards in instrumentation, control, and electrical systems of nuclear facilities. There are recent IEC standards for computer-based system security, for example IEC 62645:2014 on requirements for security programmes for programmable digital systems, and IEC 62859:2015 on requirements for coordinating safety and cybersecurity (listed under the SC 45A link, or separately for purchase in the IEC shop).

However, the effectiveness of standards sometimes conflicts with the convenience of engineers. The functional safety standard for computer-based systems in industrial process plants, IEC 61511, has been around for two decades and in some countries is also regulatory law. IEC 61511 says that SW shall be developed according to the precepts of IEC 61508 Part 3, the part of the general functional safety standard which deals specifically with SW, which has also been around for a couple of decades. Readers might like to ask their local power supplier if the generation plants use control software developed according to IEC 61508 Part 3. Or, if not, then according to which standard? I think I know the answer in most cases. I think it might help if more people knew the answer as well. One could widen the inquiry to industrial control-system software in general. I think you should be able to ask any supplier about its adherence to standards, and get a straight, informative answer.


12 01 2016

There are a few different notions of risk used in dependability engineering.

One notion, used in finance and in engineering safety, is from De Moivre (1712, De Mensura Sortis in the Proceedings of the Royal Society) and is

(A) the expected value of loss (people in engineering say “combination of severity and likelihood”).

A second notion, used in quality-control and project management, is

(B) the chances that things will go wrong.

Suppose you have a €20 note in your pocket to buy stuff, and there is an evens chance that it will drop out of your pocket on the way to the store. Then according to (A) your risk is -€10 (= €20 x 0.5) and according to (B) your risk is 0.5 (or 50%). Notice that your risk according to (A) has units which are the units of loss (often monetary units) whereas your risk according to (B) has no units, and is conventionally a number between 0 and 1 inclusive.

(A) and (B) are notions 2 and 3 in the Wikipedia article on Risk, for what it’s worth.

The International Standards Organisation (ISO) and the International Electrotechnical Commission (IEC) put out guides to the inclusion of common aspects in international standards. One is on Safety Aspects (Guide 51, 2014 edition) and one is on Risk Management (Guide 73, 2009 edition). The Guide 51 definition of risk is the combination of probability of occurrence of harm and the severity of that harm, where harmis injury or damage to the health of people, or damage to property or the environment. The Guide 73 definition of risk used to be change or probability of loss, i.e. (B), but has changed in the 2009 edition to the effect of uncertainty on objectives.

The 2013 edition of ISO/IEC 15026 Systems and Software Engineering – Systems and Software Assurance, Part 1: Concepts and Vocabulary (formally denoted ISO/IEC 51026-1:2013), defines risk to be the combination of the probability of an event and its consequence, so (A).

The IEEE-supported Software Engineering Body of Knowledge (SWEBOK) says, in Section 2.5 on Risk Management,

Risk identification and analysis (what can go wrong, how and why, and what are the likely consequences), critical risk assessment (which are the most significant risks in terms of exposure, which can we do something about in terms of leverage), risk mitigation and contingency planning (formulating a strategy to deal with risks and to manage the risk profile) are all undertaken. Risk assessment methods (for example, decision trees and process simulations) should be used in order to highlight and evaluate risks.

Notice what can go wrong is hazard identification, how and why is analysis, along with what are the likely consequences, which is severity assessment, also part of hazard analysis. What is missing here is an assessment of likelihood, which is common to both (A) and (B), the Guide 51 definition and the Guide 73 definition.

ISO/IEC 24765:2010 Systems and Software Engineering – Vocabulary defines risk to be

1. an uncertain event or condition that, if it occurs, has a positive or negative effect on a project’s objectives. A Guide to the Project Management Body of Knowledge (PMBOK® Guide) — Fourth Edition.
2. the combination of the probability of an abnormal event or failure and the consequence(s) of that event or failure to a system’s components, operators, users, or environment. IEEE Std 829-2008 IEEE Standard for Software and System Test Documentation.3.1.30.
3. the combination of the probability of an event and its consequence. ISO/IEC 16085:2006 (IEEE Std 16085-2006), Systems and software engineering — Life cycle processes — Risk management.3.5; ISO/IEC 38500:2008, Corporate governance of information technology.1.6.14.
4. a measure that combines both the likelihood that a system hazard will cause an accident and the severity of that accident. IEEE Std 1228-1994 (R2002) IEEE Standard for Software Safety Plans.3.1.3.
5. a function of the probability of occurrence of a given threat and the potential adverse consequences of that threat’s occurrence. ISO/IEC 15026:1998, Information technology — System and software integrity levels.3.12.
6. the combination of the probability of occurrence and the consequences of a given future undesirable event. IEEE Std 829-2008 IEEE Standard for Software and System Test Documentation.3.1.30

ISO/IEC 24765 thus acknowledges that there are different notions doing the rounds.

The System Engineering Body of Knowledge (SEBOK) says in its Wiki page on Risk Management that

Risk is a measure of the potential inability to achieve overall program objectives within defined cost, schedule, and technical constraints. It has the following two components (DAU 2003a):

the probability (or likelihood) of failing to achieve a particular outcome
the consequences (or impact) of failing to achieve that outcome

which is a version of (A).

What are the subconcepts underlying (A) and (B), and other conceptions of risk?

(1) There is vulnerability. Vulnerability is the hazard, along with the damage that could result from it, and the extent of that damage; this is often called “severity”. So: hazard + hazard-severity. This is close to Definition 1 of ISO/IEC 24765.
(2) There is likelihood. This can be likelihood that the hazard is realised (assuming worst-case severity) or likelihood that a specific extent of damage will result. This is only meaningful when events have a stochastic character. This is (B), the former definition in ISO/IEC Guide 73, and item 3 in the Wikipedia list.

If you have (1) and (2), you have (A) and (B). If you have (A) and (B), you have (2) (=B) but you don’t have (1). But (1) is what you need to talk about security, because security incidents do not generally have a stochastic nature.

Terje Aven, in his book Misconceptions of Risk argues (in Chapter 1) that even notion (A) is inadequate to capture essential aspects of risk. He attributes to Daniel Bernoulli the observation that utility is important: just knowing expected value of loss is insufficient to enable some pertinent decisions to be made about the particular risky situation one is in.

A third subconcept underlying risk is that of uncertainty. Aven has argued recently that uncertainty is an appropriate replacement for probability in the notion of risk. Uncertainty is related to what one knows, to knowledge, and of course the Bayesian concept of probability is based upon evaluating relative certainty/uncertainty.

It is worthwhile to think of characterising risk in terms of uncertainty where traditional probability is regarded as inappropriate. However, there are circumstances in which it can be argued that probabilities are objective features of the world; quantum-mechanical effects, for example. And if a system operates in an environment of which the parameters pertinent for system behavior have a stochastic nature, no matter how much of this is attributable to a lack of knowledge (a failure to observe or understand causal mechanisms, for example) and how much to objective variation, such probabilities surely must play a role as input to a risk assessment.

Water and Electricity

28 12 2015

We do know that they don’t mix well.

In an article in the Guardian about the floods in York, I read about the flood barrier on the River Foss that

Problems arose at the weekend at the Foss barrier and pumping station, which controls river levels by managing the interaction between the rivers Foss and Ouse. In a model that is commonplace around the country, pumps behind the barrier are supposed to pump the water clear. The station became inundated with floodwater after the volume exceeded the capacity of the pumps and flooded some of the electrics, according to an Environment Agency spokesperson, who said that a helicopter was due to airlift in parts to complete repairs on Monday.

It is particularly ironic that flood-control measures are rendered ineffective through flooding of their controls. But it’s not a one-off.

At the beginning of this month, December 6, much of the city of Lancaster (and reportedly 55,000 people) were left without power when an electricity substation in Caton Road was flooded in a previous storm.

Here is Roger Kemp’s take on the Lancaster substation affair.

In March 2011, when the tsunami resulting from the Tohoku earthquake flooded the Fukushima Daichi nuclear power station, the electrics for the emergency backup generators were also awash. If I remember correctly, in the US some of the Mark I BWRs had been modified so that the electrics controlling the emergency power generation were installed higher up in the buildings than the basement, where they still were at Fukushima Daichi. The bluff on which the power station was built had also been lowered by 15m during building to enable easier access from the seaward side.

I’ll leave it to readers to connect the dots. The question is whether the resources will be made available in the UK for review of the placement of critical electrics, and for prophylaxis. And also what we can do as members of professional societies for electrotechnology to encourage those resources to be mobilised, in the UK and elsewhere – big cities in Germany such as Hamburg and Dresden have been flooded in recent years.

I suspect that some measures would be relatively simple to implement, for example putting effective sealing accessways on vulnerable substations and other critical installations. Maybe one could seal at ground-level permanently, install sealed doors two meters up, with steps on both sides? And install an effective pump for what might nevertheless leak through a seal, with starting through a sealed battery with a float-activated switch. And so on.

As societies, we are becoming more dependent on electricity as an essential component of living, and there are plans to become even more so. This leads to vulnerabilities which I believe we haven’t yet thoroughly considered.

When I was a child, house heating came through burning coal, coke or occasionally wood. If there was an electricity cut, you could still heat your house. Nowadays, almost all building heating is electrically controlled. Even fancy wood-pellet-burning stoves, which may be connected to the circulating heating water. Take out the electricity, take out the heating too, nowadays.

According to EU statistics, in 2013 11.8% of inland energy consumption in the EU-28 was from renewable resources, and in the same year 25.4% of electricity was generated from renewable resources. Which suggests that less than half of energy consumption in the EU-28 is via electricity; much of the rest will be transportation, I suppose. Transportation’s use of energy from renewable resources was only 5.4% in 2013. There is scope for change – everyone seems to be thinking about electric road vehicles (ERVs).

I doubt whether the infrastructure exists to supply appropriate amounts of electricity for recharging ERVs if they constituted a large proportion of vehicle use, and I am not alone. The RAEng suggested in 2010 that current supply could be “overwhelmed” (Roger Kemp chaired the committee which produced the report.)

Amongst the issues are quality of electrical infrastructure. The German electrotechnical industry association ZVEI pointed out some years ago that 70% of building electrical installations have outlived their design lifetime of 30-35 years and are still in operation; also that 50 years ago there were typically 6-8 electrical devices in the average household, and now there are typically more than 70. In the presentation in which these figures appear, they were more worried about the functional safety of the installations, in particular fire risk. Malfunctioning electrics causes 15-20% of all building fires in Germany, they say. If I remember rightly, about ten times as many people die per year in building fires caused by electrical malfunction as die from electrocution: 200 as compared with 15-20. I don’t recall anything in the presentations I have seen on vulnerabilities to flooding.

When York, Lancaster and Leeds have streets lined with charging points for ERVs, I hope those points are adequately protected from floods. When half the cars along a street are electric, and the street floods to a meter depth, what is going to happen to and around those cars? Would you touch one after the floods recede? Recall there is enough stored energy in a fully-loaded vehicle to power your average Western house for a few days.

I spent a couple of years on and around German standardisation committees on ERVs. In all the meetings, I don’t recall questions concerning effects of submersion ever arising. I think they should be considered.

Kissinger on SDI and the Soviet Collapse

13 11 2015

I’ve been reading Henry Kissinger’s “summation” of international relations, World Order, which is as interesting and insightful as people have said.

He says of SDI that

[Reagan] challenged the Soviet Union to a race in arms and technology that it could not win, based on programs long stymied in Congress. What came to be known as the Strategic Defence Initiative – a defensive shield against missile attack – was largely derided in Congress and the media when Reagan put it forward. Today it is widely credited with convincing the Soviet leadership of the futility of its arms race with the United States.

He says later,

…without Reagan’s idealism – bordering sometimes on a repudiation of history – the end of the Soviet challenge could not have occurred amidst such a global affirmation of a democratic future.

By “Reagan’s idealism“, Kissinger explicitly means the idea of the “shining city on a hill“, which he says “was not a metaphor for Reagan; it actually existed for him because he willed it to exist.

Kissinger uses the “key people in positions of power” theory of the mechanisms of international relations while explaining the continuity of US foreign policy from Nixon through Ford, Carter and Reagan. Such an assertion of continuity might surprise those who were actually present during the period, but Kissinger’s argument for it is coherent, as one might expect.

Kissinger hedges his point about SDI by not actually appropriating it – he says “widely credited“, and that is correct, I think. But that doesn’t mean it’s fact.

Let me propose an alternative view, in which it was one of two major factors (amongst a plethora of others).

George Kennan foresaw how things would progress in 1947. It might be said that his view, more widely spread, established the Cold War and predicted its denouement. It had been clear for a long time by the mid-1980′s that US productivity, when channelled into military spending, could outrun that of the Soviet Union in the long term, but no one knew how long that term would be. I seem to recall some reports that the Soviets were putting 40% of their productivity into military kit, and for all anyone knew maybe they could raise that to 60%, because it could have been seen as more important than feeding people. Whereas there was no appetite in the US for even 20% spending on the military, after the Vietnam war.

SDI was in the first place an escalation of resource consumption. It wasn’t based on a Reagan decision alone; it was based more generally on fantasy in the US military, of which there was a plentiful supply. I remember an eminent colleague in the mid-80′s recounting a meeting with a USAF general officer whose vision consisted of a helmet which could read and execute the thoughts of a fighter pilot: “fly there, do that, shoot that; I just THINK about it and it happens“. Thirty years later, bits of that have been implemented. Whereas the SDI vision is having trouble achieving even a 50% success rate in one-on-one anti-ICBM-missile trials, according to the table in this 2014 article. Now, I suspect well-grounded Soviet military technologists knew as well as well-grounded US military technologists that SDI at that point in the 1980′s was fantasy. The arguments are not hard; they were, as expressed by David Parnas, convincing, true and public. Some people in the Soviet Union surely must have known that SDI was bluff.

So what was SDI’s role in the Soviet collapse? I suggest it may have been half of it. The other half was Reagan suggesting directly to Gorbachev that both sides could just scrap their nuclear missiles, and meaning it. The Soviet leadership realised they were playing with someone who was far wealthier, who could more or less bet anything he pleased at any point in the game, at whim. If you’re on welfare, and you’re playing poker with a millionaire who has just spent €10,000 in front of you on a tie because he didn’t like the one he was wearing, and he’s offering at the same time to stop the game, it’s not clear what you should best do but stopping right now must seem an attractive option.

And, of course, if Kennan was right, which apparently everyone now thinks he was, then the collapse would have happened anyway, with or without SDI. But it might have taken a bit longer. Then of course there was that bit about taking down a wall in Berlin that might have had something to do with it.

The Accident to SpaceShip Two

3 08 2015

Alister Macintyre noted in the Risks Forum 28.83 that the

US National Transportation Safety Board (NTSB) released results of their investigation into the October 31, 2014 crash of SpaceShipTwo near Mojave, California.

The NTSB has released a preliminary summary, findings and safety recommendations for the purpose of holding the public hearing on July 28, 2015. All those may be modified as a result of matters arising at the hearing. This is standard procedure for the Board.

Their summary of why the accident happened is

[SpaceShip2 (SS2)] was equipped with a feather system that rotated a feather flap assembly with twin tailbooms upward from the vehicle’s normal configuration (0°) to 60° to stabilize SS2’s attitude and increase drag during reentry into earth’s atmosphere. The feather system included actuators to extend and retract the feather and locks to keep the feather in the retracted position when not in use.

After release from WK2 at an altitude of about 46,400 ft, SS2 entered the boost phase of flight. During this phase, SS2’s rocket motor propels the vehicle from a gliding flight attitude to an almost-vertical attitude, and the vehicle accelerates from subsonic speeds, through the transonic region (0.9 to 1.1 Mach), to supersonic speeds. ….. the copilot was to unlock the feather during the boost phase when SS2 reached a speed of 1.4 Mach. …. However, …. the copilot unlocked the feather just after SS2 passed through a speed of 0.8 Mach. Afterward, the aerodynamic and inertial loads imposed on the feather flap assembly were sufficient to overcome the feather actuators, which were not designed to hold the feather in the retracted position during the transonic region. As a result, the feather extended uncommanded, causing the catastrophic structural failure.

This, the Board notes, represents a single point of catastrophic failure which could be instigated, was in this case instigated, by a single human error.

A hazard analysis (HazAn) is required by the FAA for all aerospace operations it certifies. It classifies effects into catastrophic, hazardous, major, minor and “no”, and certification (administrative law) requires that the probability of events in certain classes is ensured to be sufficiently low, through avoidance or mitigation of identified hazards.

HazAn is a matter of anticipating deleterious events in advance. The eternal questions for HazAn are:

  • Question 1. Did you think of everything? (Completeness)
  • Question 2. Does your mitigation/avoidance really work as you intend?

These questions are very, very hard to answer confidently. Imperfect HazAns are almost inevitable in novel operations. In aviation, sufficient experience has accumulated over the decades to ensure that the HazAn process fits the standard kinds of kit and operations and the answers to the questions are to a close approximation yes-yes. In areas in which there is no experience, for example use of lithium-ion batteries for main and auxiliary electric-power storage in the Boeing 787, answers appeared to be no-no . In commercial manned spaceflight, there is comparatively a tiny amount of experience. Certification of a new commercial transport airplane takes thousands of hours. Problems are found and usually fixed. SS1 and SS2 have just a few hours in powered spaceflight so far.

As soon as the accident happened it was almost inevitable that the answer to either Question 1 or Question 2 was “no”. The NTSB summary doesn’t actually tell us whether it was known that unlocking the booms too early would overstress the kit, but given Scaled Composites’ deserved reputation, as well as the strong hint from the NTSB that human factors were not sufficiently analysed, I would guess that the answer is yes; and the answer to Question 2 is partially no: the mitigation works unless the pilot makes an error under the “high workload” (performing many critical tasks under physical and cognitive stress) of transonic flight.

I emphatically don’t buy Macintyre’s suggestion that anyone “cut corners” on test pilot training and HazAn.

These are brand-new operations with which there is very little experience and (contrary to marketing) are inevitably performed at higher risk than operations with thousands or millions of hours accumulated experience. Nobody, in particular no one at Scaled, messes around in such circumstances. Scaled has a well-deserved reputation over three decades for designing radically new aerial vehicles to enviably high standards of safety. But things do sometimes go wrong. Voyager scraped a wingtip on takeoff and nearly didn’t make it around the world (they had 48kg of fuel remaining when they landed again at Edwards after nine days of flight in December 1986, enough only for a couple hours more). Three people were killed during a test of a rocket system in 2007 which was based on a nitrous oxide oxidiser, apparently a novel technology. OSHA investigated. An example of some public commentary is available from Knights Arrow. Scaled has been owned by Northrop Grumman since 2007 (before the rocket-fuel accident). And now a test pilot has lost his life and the craft by performing an action too early.

It may be more apt to note that, like many such analyses of complex systems with proprietary features, the HazAn for WK2/SS2 space operations is substantial intellectual property, whose value will increase thanks to the NTSB’s suggestions on how to improve it.

The purpose of the NSTB’s investigation is to look minutely at all the processes that enabled the accident and to suggest improvements that would increase the chances of a yes-yes pair of answers to the HazAn questions as well as all other aspects of safety. They said the human factors HazAn could be improved. Since human error was presumed to be the single point of failure, that conclusion was all but inevitable. The NTSB also suggested continuity in FAA oversight – the FAA flight-readiness investigation was carried out by different people for each flight so there was reduced organisational learning. As also some other stuff about how to improve the efficacy of oversight, and organisational learning such as the mishap database. And the NTSB suggested proactive emergency readiness by ensuring a rescue craft is on active standby (it usually was, but this wasn’t the case for the accident flight).

One wonders what else in the HazAn isn’t quite right. There are plenty of places to look (witness the Knights Arrow report above on the fuel choice). It doesn’t mean the HazAn is bad. But it will be improved. And improved, all with the goal of getting to yes-yes.

Volvo Has An Accident

5 06 2015

……. but not the one you thought!

Jim Reisert reported in Risks 28.66 ( Volvo horrible self-parking car accident) on a story in on 2015-05-26 about a video of an accident with a Volvo car, apparently performing a demo in the Dominican Republic. The story is by Kashmir Hill. Hill says “….[the video] is terrifying“. The video is linked/included in the piece.

The video shows a Volvo car in a wide garage-like area, slowly backing up, with people standing around, including in front of the vehicle. The car stops, begins to move forward in a straight line, accelerates, and hits people who did not attempt to move out of the way. Occupants are clearly visible in the car. The video is about half a minute long.

I didn’t find it terrifying at all. At first glance, I found it puzzling. Why didn’t people move out of the way? They had time.

Fusion reports comments from Volvo. I looked the story up using Google. Lots of articles, many of them derivative, and a reference to Andrew Pam’s corrective comment in Risks 28.67. From the better articles (in my judgement), one would crudely understand:

  • The car was being driven. What you see is not automatic.
  • It wasn’t a demo of self-parking. It was a purported demo of a collision-avoidance function.
  • The other-car collision-avoidance function is standard; the pedestrian-collision-avoidance function is an optional extra.
  • The demo car was not equipped with this optional function.

However, many of the articles still have “self-parking” in the headline or as part of the URL, and journalists asked why other-car collision-avoidance is standard, but pedestrian-collision-avoidance an optional extra. Surely, some journalists expect us to conclude, it would be more reasonable the other way around?

What Volvo actually said in response to journalists’ queries seems to be reasonable (see below). But they appear not to be controlling the narrative, and that is their accident. The narrative appears to be that they have a self-parking car which may instead accelerate into passers-by unless it is equipped with a $3,000 extra system to avoid doing so. And this is demonstrated on video. And this narrative is highly misleading.

Other-car/truck detection and avoidance is nowadays relatively straightforward. These objects are big and solid, have lots of metal and smooth plastic which reflects all kinds of electromagnetic and sound waves, and they behave in relatively physically-limited ways. People, on the other hand, are soft and largely non-metallic, with wave-absorbent outer gear, and indulge in, ahem, random walks. It’s a harder detection problem, and it is thereby much harder to do it reliably – you need absolutely no false negatives, and false positives are going to annoy driver and occupants. Such kit inevitably costs something.

But there is a laudable aspect to this commentary. Some, even many, journalists apparently think that pedestrian-collision avoidance systems should be standard, and are more important than other-car collision avoidance. I wish everybody thought like that!

Ten years ago, almost nobody did. I recall an invited talk by a senior staff member of a major car company at the SAFECOMP conference in Potsdam in 2004, about their collision-avoidance/detection/automatic-communication-and-negotiation systems and research. 45 minutes about how they were dealing with other vehicles. I asked what they were doing about pedestrians and bicycles. A 5-second reply: they were working on that too.

Pedestrians are what the OECD calls “vulnerable road users”. While accident rates and severities have been decreasing overall for some years, accident rates and severities for vulnerable road users have not – indeed, in some places they have been increasing. Here is a report from 17 years ago. The Apollo program, which is joint between the WHO and the EU, has a policy briefing ten years later (2008).

I am mostly a “vulnerable road user”. I have no car. My personal road transport is a pedelec. Otherwise it’s bus or taxi. Bicycle and pedelec road users need constantly to be aware of other road users travelling too fast for the conditions and posted speed limits, too close to you, and about to cut you off when you have right of way. As well as occasional deliberately aggressive drivers. All of which is annoying when you’re sitting inside another well-padded and designedly-collapsible shell, but means serious injury or death if you’re not.

I am all for people thinking that vulnerable-road-user detection and avoidance systems should be standard equipment on automotive road vehicles.

There are similar reports to that in Fusion also in:

as well as elsewhere. I like Caldwell’s Slashgear article far more than the others.

Andrew Del-Colle deals out a lengthy corrective in both Road & Track and in Popular Mechanics.

Three Volvo spokespeople are quoted in these articles: Johan Larsson (Fusion, and derivatively The Independent), Stefan Elfstroem (Slashgear and Money) and Russell Datz (Daily Mail). Volvo’s comment is approximately:

  • The car was equipped with a system called “City Safe” which maintains distance from other cars.
  • City Safe also offers a pedestrian-detection system, which requires additional equipment and costs extra money
  • The car was not equipped with this additional system
  • The car appears to be performing a demo. It is being driven.
  • The demo appears to be that of City Safe, not of the self-parking function.
  • The car was apparently being driven in such a way that neither of these systems was operational: the human driver accelerates “heavily” forwards.
  • When an active driver accelerates forwards like this, the detection-and-braking functions are not active – they are “overridden” by the driver command to accelerate
  • Volvo recommends never to perform such tests on real humans

All very sensible.

One major problem which car manufacturers are going to have is that, with more and more protective systems on cars, there are going to be more and more people “trying them out” like this. Or following what John Adams calls “risk homeostatis”, in driving less carefully while relying on the protective functions to avoid damage to themselves and others. I am also sure all the manufacturers are quite aware of this.

Cybersecurity Vulnerabilities in Commercial Aviation

18 04 2015

The US Government Accounting Office has published a report into the US Federal Aviation Administration’s possible vulnerabilities to cyberattack. One of my respected colleagues, John Knight, was interviewed for it. (While I’m at it, let me recommend highly John’s inexpensive textbook Fundamentals of Dependable Computing for Software Engineers. It has been very well thought through and there is a lot of material which students will not find elsewhere.)

None of what’s in the report surprises me. There are three main points (in the executive summary).

First, the GAO suggests the FAA devise a threat model for its ground-based ATC/ATM systems. (And, I presume, that the FAA respond to the threat model it devises.) I am one of those people who consider it self-evident that threat models need to be formulated for all sorts of critical infrastructure. One of the first questions I ask concerning security is “what’s the threat model?“. If the answer is “there isn’t one” then can anybody be suprised that this is first on the list?

Lots of FAA ground-based systems aren’t geared to deal with cybersecurity threats – many of them are twenty or more years old and cybersecurity wasn’t an issue in the same way it is coming to be. Many systems communicate over their own dedicated networks, so that would involve a more or less standard physical-access threat model. But many of them don’t. Many critical inter-center communications are carried over public telephone lines and are therefore vulnerable to attacks through the public networks, say on the switches. Remember when an AT&T 4ESS switch went down in New York almost a quarter century ago? I can’t remember if it was that outage or another one during which the ATCOs called each other on their private mobiles to keep things working. A human attacker trying to do a DoS on communications would probably try to take out mobile communications also. (So there’s the first threat for the budding threat model – a DoS on communications)

If the FAA don’t want to do a model themselves, couldn’t they just get one from a European ally and adapt it? The infrastructures aren’t that dissimilar on the high level and anything would be a help initially.

Second, when the FAA decided they were OK with the manufacturer putting avionics and passenger in-flight entertainment (IFE) data on the same databuses on the Boeing 787, many of us thought this premature and unwise and said so privately to colleagues (one of them even found the correspondence). We have recently had people claim to be able to access critical systems through the IFE (see below). I have reported on one previous credible claim on vulnerabilities in avionics equipment.

The GAO is suggesting that such configurations be thought through a little more thoroughly. The basic point remains: isn’t it abundantly clear that the very best way to ensure as much non-interference as possible is physical separation? Who on earth was thinking a decade ago that non-interference wouldn’t be that much of an issue? Certainly not me.

Third, the other matters the GAO addressed are organisational, which is important of course for the organisation but of little technical interest.

Concerning accessing critical avionics systems through the IFE, Fox News reports that Cyber security researcher Chris Roberts was pulled off a US commercial flight and interrogated by the FBI for a number of hours.

A colleague commented that “they are going after the messenger.” But let’s look at this a little more carefully.

Chris Roberts is CTO and founder of One World Labs in Denver. Staff at One World consist of a CEO who is a lawyer, a CFO and a VP of sales and marketing, and two technical employees, one of whom is Roberts, who is the company founder. The board appears to be well-balanced, with a former telecommunications-industry executive and a military SIGINT expert amongst others.

One World claims to have the “world’s largest index of dark content“, something called OWL Vision, to which they apparently sell access. One wonders how they manage to compile and sustain such a resource with only two technical people in the company, but, you know, kudos to them if it’s true.

According to the first line of his CV, Roberts is “Regarded as one of the world’s foremost experts on counter threat intelligence within the cyber security industry“. His CV consists of engagements as speaker, and press interviews – there is nothing which one might regard as traditional CV content (his One World colleagues provide more traditional info: degrees, previous work experience and so on). His notable CV achievements for 2015 are a couple of interviews with Fox.

Apparently he told Fox News in March, quoted in the article above, “We can still take planes out of the sky thanks to the flaws in the in-flight entertainment systems. Quite simply put, we can theorize on how to turn the engines off at 35,000 feet and not have any of those damn flashing lights go off in the cockpit…… If you don’t have people like me researching and blowing the whistle on system vulnerabilities, we will find out the hard way what those vulnerabilities are when an attack happens.

Read that first sentence again. He can take planes out of the sky due to flaws in the IFE, he says. Does it surprise anybody that the FBI or Homeland Security would want to find out exactly what he means? Maybe before he gets on a flight, taking some computer equipment with him? It is surely the task of security services to ensure he is not a threat in any way. If you were a passenger on that airplane, wouldn’t you like at least to know that he is not suicidal/paranoid/psychotic? In fact, wouldn’t you rather he got on the plane with a nice book to read and sent his kit ahead, separately, by courier?

It has been no secret for fourteen years that if you are going to make public claims about your capabilities you can expect security agencies nowadays to take them at face value. Would we want it otherwise?

Let us also not ignore the business dynamics. You have read here about a small Denver company, its products and claimed capabilities. I am probably not the only commentator. All at the cost to a company employee of four hours’ interrogation and the temporary loss of one laptop. And without actually having to publish their work and have people like me analyse it.

Germanwings 9525 and a potential conflict of rights

11 04 2015

Work continues on the investigation into the crash of Germanwings Flight 9525. I note happily that news media are reverting to what I regard as more appropriate phraseology. Our local newspaper had on Friday 27th March two-word major headline “Deadly Intention“, without quotation marks, and the BBC and Economist were both reporting as though an First Officer (FO) intention to crash the plane was fact. Written media are now reverting to what most of us would consider the formally more accurate “suspected of” phraseology. (For example, see the German article below.)

Flight International / Flightglobal had as main editorial in the 31 March – 6 April edition a comment deploring the way matters concerning the Germanwings crash are being publicly aired.

I read Flight as suggesting the Marseille procureur was abrupt. Many of us thought so at the time. An article from this week’s Canard Enchaine shows that part of the French (anti-)establishment agrees with that assessment, but for different reasons, concerning some political manoeuvring.

But Flight gets the logic wrong. The procureur was not announcing his “conviction” that the FO was “guilty” of…. whatever; neither was the announcement “surreal” by virtue of the fact that the FO was dead.

  • The procureur was not announcing the degree of his belief. He was making an accusation, in the usual formal manner using the usual formal terminology;
  • He was not judging the FO as “guilty”; that’s neither his job nor his right and he is obviously clear about that. Only a court can pronounce guilt.
  • It is not surreal: as Flight should be aware, in France prosecutions are brought, and are sometimes successful, after accidents in which everyone on board died, viz. Air Inter and Concorde. There is a case to be made that people at the airline had overlooked medical information on the FO which (would have) rendered him formally unfit to fly. There is the further possibility that there existed medical information relevant to his fitness to command a commercial airliner which was not shared with the relevant parts of the airline and/or regulator.

There is also a procedural aspect to the formal announcement by the Marseille procureur on Thursday 26th March which the Flight editorial ignores. Everyone knows the importance of preserving and gathering evidence quickly, in this case evidence about the FO. Presumably everyone agrees that it is a good thing. In order to set that process in motion, there need to be formal legal actions undertaken. The crash event took place within the jurisdiction of Marseille. Formal proceedings therefore need to be opened in Marseille and German legal authorities informed and cooperating in those proceedings in order to gather and preserve evidence in Germany. Obviously this needs to be done ASAP, because who knows how other people with immediate access to such materials are going to react. The question is whether proceedings have to be opened at a florid press conference. In this case it might have been hard to avoid.

In its editorial, Flight suggests the BEA is in a more appropriate position to gather evidence than prosecutors, and that they should be allowed to get on with that job. The other industry stalwart, Aviation Week and Space Technology, also says in a recent editorial that “We find more objectivity in accident investigators’ reports than in prosecutors’ statements.” I disagree. State attorneys’ offices and police are far more experienced at securing the kind of evidence likely to be relevant to the main questions about this crash than are aircraft safety investigators.

It seems to be the case that medical information relevant to the FO’s fitness was not distributed appropriately. For example, information concerning a 2009 depressive episode. The airline knew about this episode, and subsequently flight doctors have regularly judged him fit to fly (he regularly obtained a Class 1 medical certificate according to the annual renewal schedule). However, in April 2013 Germany brought into law the EU regulation that the regulator (LBA) must be informed and also determine fitness when an applicant has exhibited certain medical conditions. The LBA has said that it wasn’t so informed of the 2009 episode. (Here is a German news article on that, short and factual. It also laudably uses the “suspected” terminology.) If so, that seems to be an operational error for which the FO was not at all responsible in any way.

It is exactly right that the Marseille procureur along with his German counterparts is looking at all that and it is also right that that was undertaken very quickly.

There is a wider question. The confidentiality of German medical information is all but sacrosant. Its confidential status overrides many other possibly conflicting rights and responsibilities, and I understand this has been affirmed by the Constitutional Court. Pilots have an obligation to self-report, so medical confidentiality has not come into conflict with duty of care – yet. But what about a case when medical conditions indicating unfitness to fly are diagnosed, but the pilot-patient chooses not to self-report? The pilot flies for an airline; the airline has a duty of care. If something happens to a commercial flight which this pilot is conducting, which causes harm to the airline’s clients (passengers) and others (people and objects on the ground near a CFIT; relatives of passengers), then the airline has obviously not fulfilled its duty of care to those harmed: the pilot should not have been flying, but was. However, equally obviously, the airline was unable to fulfil its duty of care: it was deprived of pertinent knowledge.

Personality assessments are used by some employers in the US in evaluating employees. See, for example, the survey in the second, third and fourth paragraphs of Cari Adams, You’re Perfect for the Job: The Use and Validity of Pre-employment Personality Tests, Scholars journal 13; Summer 2009, along with the references cited in those paragraphs. It is not clear to me at this point whether it is legal in Germany to require potential employees to undergo such tests. (As I have indicated previously, I do think that some tests, such as MMPI, could identify extreme personality characteristics, which could be associated with future inappropriate behaviour when operating safety-critical systems, in some cases where these would not necessarily be picked up in the usual employee interviews.)

I suggest that this employee medical confidentiality/employer’s duty of care issue is a fundamental conflict of rights that won’t go away. It may be resolved but it cannot be solved. It may turn out that it is currently not so very well resolved in Germany. I would judge it a good thing if this one event opens a wider debate about the conflict.

Thoughts After 4U 9525 / GWI18G

4 04 2015

It is astonishing, maybe unique, about the Germanwings Flight 4U 9525 event how quickly it seems to have been explanatorily resolved. Egyptair Flight 990 (1999) took the “usual time” with the NTSB until it was resolved, and at the end certain participants in the investigation were still maintaining that technical problems with elevator/stabiliser had not been ruled out. Silk Air Flight 185 (1997) also took the “usual time” and the official conclusion was: inconclusive. (In both cases people I trust said there is no serious room for doubt.) There are still various views on MH 370, and I have expressed mine. However, it appears that the 4U 9525/GWI18G event has developed a non-contentious causal explanation in 11 days. (I speak of course of a causal explanation of the crash, not of an explanation of the FO’s behaviour. That will take a lot longer and will likely be contentious.)

A colleague noted that a major issue with cockpit door security is how to authenticate, to differentiate who is acting inappropriately (for medical, mental or purposeful reasons) from who isn’t. He makes the analogy with avionics, in which voting systems are often used.

That is worth running with. I think there is an abstract point here about critical-decision authority. Whether technical or human, there are well-rehearsed reasons for distributing such authority, namely to avoid a single point of decision-failure. But, as is also well-rehearsed, using a distributed procedure means more chance of encountering an anomaly which needs resolving.

What about a term for it? How about distributed decision authority, DDA. DDA is used in voted-automatics, such as air data systems. It is also implicit in Crew Resouce Management, CRM, a staple of crew behavior in Western airlines for a long time. Its apparent lack has been noted in some crew involved in some accidents, c.f., the Guam accident in 1997 or the recent Asiana Airlines crash in San Franciso in 2013. It’s implicitly there in the US requirement for multiple crew members at all times in the cockpit, although here the term “DDA” strains somewhat – a cabin crew member has no “decision authority” taken literally but rather just a potentially constraining role.

There are also issues with DDA. For example, Airbus FBW planes flew for twenty years with air data DDA algorithms without notable problems: just five ADs. Then in the last seven years, starting in 2008, there have been over twenty ADs. A number of them modify procedures away from DDA. They say roughly: identify one system (presumably the “best”) and turn the others off (implicitly, fly with just that one deemed “best”). So DDA is not a principle without exceptions.

A main question is what we need to do, if anything.

For example, consider the measures following 9/11. Did we need them and have they worked? Concerning need; I would say a cautious yes. (Although I note the inconvenience has led me to travel around Europe mainly by rail.) The world seems to contain more organisations with, to many of us, alien murderous ideologies. 9/11 was a series of low-technology, robust (multiple actors per incident) hijackings. Attempts have been made since to destroy airliners with moderate-technology and solitary actors (shoe bomber, underpants bomber, printer cartridge bombs) but these have all failed. They are not as robust; in each case, there was just one agent, and moderate-technology is nowhere near as reliable as low-technology: bombs are more complex than knives. One of them could have worked, but on one day in 2001 three out of four worked. It seems to me that, in general, we are controlling hijackings and hostile deliberate destruction moderately well.

After 4U 9525 do we need to do something concerning rogue flight crew? Hard to say. With the intense interest in the Germanwings First Officer’s background it seems to me likely that there will be a rethink of initial screening and on-the-job crew monitoring. Talking about the pure numbers, seven incidents in 35 years is surely very low incidence per flight hour, but then it’s not clear that statistics are any kind of guide in extremely rare cases of predominantly purposeful behavior. For example, how do we know there won’t be a flurry of copycat incidents? (I suspect this might be a reason why some European carriers so quickly instituted a “two crew in cockpit at all times” rule.)

What about classifying airlines by safety-reliability? A cursory look suggests this might not help much. Three, almost half, of murder-suicide events have been with carriers in Arnold Barnett’s high safety grade. Barnett has published statistical reviews of world airline safety from 1979 through recently (see his CV, on the page above, for a list of papers). His papers in 1979 and 1989 suggested that the world’s carriers divided into two general classes in terms of chances of dying per flight hour or per flight. Japan Air Lines, Silk Air (the “low-cost” subsidiary of Singapore Airlines) and Germanwings (the “low-cost” subsidiary of Lufthansa) are all in the higher class.

I consider it certain that DDA with flight crew will be discussed intensively, including cockpit-door technology. Also flight-crew screening and monitoring. What will come of the discussions I can’t guess at the moment.