An Ethical Statement on Incidents

27 06 2009

Donald Gotterbarn and Keith W. Miller wrote on a Software Engineering Code of Ethics in the June 2009 edition of IEEE Computer magazine. They illustrate the application of their principles with some case studies, including Case Study 2: Who Is In Control?

They consider first the October 2008 Qantas accident, concerning which an
interim factual report is available from the ATSB. Gotterbarn and Miller say

[begin quote]

“ The software on this Airbus 330-303 implemented a decision to give instant control to the plane’s flight control system when the autopilot shut off because of computer system failures. The resulting nosedive suggests that this decision was not in the best interest of the public, especially members of the public in or below this airplane.

There are good reasons to have the flight control system protect the jet from dangerous conditions. But this incident illustrates that the decision to turn over control to the flight control system should take into account the current state of inputs into that system. The flight control system should have been more sensitive to the quality of its inputs and to the possibility of disastrous consequences for instantly reacting to apparent conditions that were based on erroneous inputs.”

[end quote]

This is philosophy without understanding. I am not even sure the authors know what a flight control system is.

They they go on to consider the infamous incident in Russia, during which a pilot let his kids into the cockpit and gave them a hand in flying the airplane. There was an upset and the airplane crashed. Gotterbarn and Miller say

[begin quote]

“After such a disaster, we would expect the developers of subsequent Airbus autopilot software to be particularly sensitive to issues of control transfer between pilots and autopilots.

In the Aeroflot crash, much of the publicity focused on the judgement of the pilot in inviting his children into the cockpit. While that appears to have been a contributing factor in the tragedy, the autopilot design was at least as significant.”

[end quote]

What an extraordinary comment! I think no more needs to be said about that than what I say, below, in my letter to the Editor-in-Chief of IEEE Computer:

[begin letter]

Dear Professor Carver,

Professors Gotterbarn and Miller (The Public is the Priority, IEEE Computer, June 2009:66-73) omit one important ethical principle favored by those of us who analyse incidents: refrain from making imposing public statements on technical matters about which you know little.

The authors illustrate well the reasons for this principle through their Case Study 2. They introduce the 2008 Qantas accident and suggest that a

“…decision to give …. control to the … flight control system when the autopilot shut off because of … system failures….. was not in the best interest of the public.”

Autoflight systems have been doing exactly this since they were invented over half a century ago, and no pilot or engineer I know would have it otherwise.

The ATSB preliminary analysis hints rather at an obscure bug with the Flight Control Primary Computer, as well as a yet-undiagnosed fault in one of the air data subsystems. Let us hope that our colleagues at the companies concerned are able to discover what and how and devise remedies.

One can indeed hold moral views arising from this and other incidents, such as that critical software and interfaces need to be rigorously proven free from every possible source of error, but most software engineers would agree that best practice is still some way from that ideal, and back when flight control systems were cables and pulleys we were not close to it either.

Concerning the Aeroflot upset, I feel strongly that children should not be placed at the controls of commercial passenger jets in flight, and that it is silly to suggest that the system design should accommodate such an eventuality.

Sincerely,

Peter Bernard Ladkin

[end letter]



Formal Methods in Modern Critical-Software Development

22 06 2009

with Martyn Thomas, co-author.

[A couple of weeks ago, Martyn Thomas and I were contacted by a journalist for the German weekly Der Spiegel. He asked me a question which I found hard to answer for non-specialists: what are "formal methods?" Here is the answer which Martyn and I supplied.]

There has long been a view held by prominent computer scientists, such as Turing-Award winner Professor Sir Tony Hoare, that software, a computer program running on a physical computer, is predominantly mathematics. This view opens up the possibility of proving mathematically that a program is free from errors.

For many years, this view has contrasted with the need of computer-programming companies to employ increasing numbers of people with an aptitude for writing “high-level” computer programs to get their large systems written and sold. This traditional method can be summarised as “write-what-you-think-will-work, test, and fix”. Its use often results in systems which exhibit unresolvable problems and are deployed very, very much later than planned, if at all – some estimates suggest that over 50% of SW which involves more than a few million lines of “high-level” computer source code is never put into service. That seems to many software engineers to be a terrible waste of human effort. The traditional method has no way of showing that a program is free from errors; indeed, it is expected that errors will be found, and then fixed, until no more errors show up. But it was estimated by work at IBM two decades ago, which looked at a specific highly-used complex system, that about a third of the errors that manifest themselves would only show up less frequently than once every billion hours or so. That means you see them once and they don’t turn up again – but they are nevertheless a third of the all errors you see, so even if you carefully eliminate all errors you see, including the more frequent ones, that third stubbornly remains.

General estimates of software quality suggest that one can expect about ten errors per one thousand lines of “high-level” computer code. For safety-critical systems, industry-insider estimates suggest a level of between one and three errors per one thousand lines of code, which is much better, but it is still not good enough for systems whose software could possibly induce dangerous behavior, because once a computer encounters and follows a software error, there are no good ways to predict how that computer will behave. This radical unpredictability following a fault is one of the major ways in which software-induced behavior differs from that of purely physical systems, which generally behave in more predictable ways when they fail.

The mathematics of statistical testing show that one can improve confidence that a program really does what is wanted – and, more importantly for safety, doesn’t do what is not wanted – with fault rates up to about one in one hundred thousand hours of operation. But that would be too high a fault rate for many critical systems – consider that the flight control system of the Airbus A320 aircraft, for example, has flown some 50-60 million operational hours, without any accident attributable to a fault in the control system. Given a desired ultra-low fault rate, say of that order, a testing regime to increase confidence that such a fault rate has been attained would have to test for at least the entire planned life of the system. That means, in practical terms, that the system’s operation and deployment would just be one big test – and what happens if it fails? We need to have the required confidence before a system is deployed that is of a certain quality. The mathematics says we can’t get that through testing if we want ultradependable systems, as most safety-critical systems are expected to be.

People who accept the view that computer programs are mathematics have been working on that mathematics, and developing tools to aid with it. This effort has been going on seriously, in avionics for example, for over 30 years – ever since the Californian consulting company SRI International was awarded the first contract by NASA to prove using mathematical logic that the operating system for the first digital-computer flight control system, SIFT, did exactly and only what its careful technical specification said it should do. That effort failed, but the lessons learned have informed progress in the mathematical view of computer software for three decades. Two decades ago, the traditionalist head of the U.S. Department of Defence Advanced Research Projects Agency’s Information Science and Technology Office – the same organisation responsible for the idea and technologies of the Internet – was reputed to have said, of the mathematical approach to software, “formal methods don’t work”. In 1994, a bug was found in Intel’s Pentium processor, which affected the correctness of certain division operations. The analysis that followed, the most significant of it pro bono work by academics and engineers, resulting in Intel replacing the chips, at a reputed cost of some $400 million or more. The company decided they needed better quality assurance in their designs, and started a formal methods group. Today, no complex chip may be effectively designed without using such methods.

The logic of chips is simpler than that of general software, and so it is easier to develop effective formal methods for chip design than it is for general software. However, the approach is the same. We write down in mathematics what we want the program or the chip to do – its requirements specification, as we call it – and if we design the program using design languages that are also a form of mathematics, then we have the possibility to prove that the final program does exactly what was required, no more and no less, and if we make any mistakes the mathematics will quickly point them out. Formal methods consists largely of the science and use of these mathematical languages for specifying, designing and analysing computer programs, and the various methods for proving that a program is free from various kinds of errors, or, with sufficient effort, that it is free from all. Nowadays, companies that use formal methods are able to produce very much higher quality software – software with far fewer, even no, errors in it – than is produced by the majority of companies. Data maintained by companies in the forefront of the use of formal methods show that they can supply code to quality levels of better than one error per twenty-five thousand lines of code, some twenty-five to seventy-five times better quality than by using traditional methods. And, crucially, formal methods can provide assurance that a certain quality of code has been achieved. Traditional methods cannot provide such assurance.

Consider, for example, the 100,000 lines of program code which controls the engines on the Boeing 777 aircraft that crashed at London Heathrow airport in January 2008, when the engines did not respond to commands to increase thrust shortly before the aircraft was due to land on the runway. If that code had been developed using careful, traditional methods of critical-software development, it seems we could expect about 100 errors in it. And what of the millions of lines of code in the flight control systems of a modern Boeing or Airbus computer-controlled aircraft? Would we be content flying with the thousands of errors that we might expect from traditional software code development? It is no wonder that these companies are steadily increasing use of formal methods in their systems development.

Indeed, nobody we know is content with levels of error of one in a thousand lines in critical software, not even those who develop it using traditional methods. If we want to improve the quality of such software, and to be assured of that quality, to the point at which there are no errors, then there is no alternative to using formal methods. These days, formal methods are well supported with computer-based tools that do much of the checking and proving automatically, which means that software written using formal methods can also be much cheaper to produce, as well as much higher quality, than software using traditional methods, because most of the cost of writing software using traditional methods comes from finding and fixing the mistakes that programmers make.



AF 447 ACARS: A Mistake with a Life of its Own

14 06 2009

Here is yet another indication of how things can get a life of their own:-

Soon after the France 2 program showing the ACARS transcript messages on 4 June, someone on the pilot’s forum PPRuNe typed them up, and posted them to imageshack. Now they apparently made it onto eurocockpit.com . The New York Times’s Matt Wald, a reliable commentator, commented the ACARS messages yesterday, June 13 and there is a graphic on the NYT WWW site explaining the messages.

Wald said “its authenticity has been confirmed by industry officials”.

Except there is a typographical error, since corrected by the original transcriber. The ISIS message was a 3422 message, according to the original transcript (DG and Ind), but it was shown as a 3412 (OAT and Ind./Sensor) code message on the original image, and it is so shown now on the NYT WWW site.

This can be clearly seen in the screen grabs from the TV program from Danny Fyne, the PPRuNe originator:
http://www.pprune.org/rumours-news/376433-af447-2.html#post4975127
and in the higher-resolution screen grabs from contributor Machaca
http://www.pprune.org/rumours-news/376433-af447-2.html#post4975217

The list was typed up from the screen grabs by contributor selfin:
http://www.pprune.org/rumours-news/376433-af447-3.html#post4975386
The original version contained the erratic 3412 transcription for 3422, and has since been edited and corrected by selfin, as noted on the post itself. Here is the message, from contributor Captain-Crunch, in which the typo was first noted (with the *original* images to show it):
http://www.pprune.org/rumours-news/376433-af447-4.html#post4975726



AF 447 ACARS Messages: Reading Tea Leaves

11 06 2009

A list of the 24 ACARS messages listed by Air France that were sent from AF 447 between 0210Z and 0214Z on 1 June, 2009, the last information received from the aircraft, was shown on the France 2 TV channel on Thursday June 4. This list, in which incomplete information was shown, was typed up and distributed on the Internet (one must beware of typographic errors in the various versions which I have seen). Thus people started to interpret the messages and inquire about their significance.

I take it that people know what “reading tea leaves” means? Fortune tellers would look at the pattern of leaves left in the cup after the tea had been drunk, and wondering what they say about the future. Similarly, people (including myself, here) have been looking at the (partial) ACARS messages shown on the TV, and have been wondering what they say about the past. I adduce the comparison to propose a healthy dose of scepticism about what one can validly conclude from the currently publicly-available information.

The messages were listed in the following order (omitting messages which consist of maintenance warnings). The four-digit numbers are the Joint Aircraft System/Component (JASC) code, which I interpret from the FAA JASC Table and Definitions Document from February 11, 2002, which is on-line.

* at 3.5 hours before the main events, a 3831 event. Something concerning waste disposal (38 is water and waste, and 3830 is the waste disposal system)

* at 0210, a 2210 event: AP off (22 is Auto Flight and 2210 is the Autopilot system)

* at 0210, a 2262 event (22 is Auto Flight; I have no code 2260)

* at 0210, a 2791 event, flight control switch to alternate law (27 is flight controls; I have no code 2790 or 2791)

* at 0210, two 2283 events, flags raised on CAP and FO Primary Flight Displays (PFD) (22 is Auto Flight, I have no code 2283)

* at 0210, a 2230 event, autothrust off (2230 is the auto throttle system)

* at 0210, a 3443 event, a TCAS problem (34 is navigation; 3443 is the Doppler system. The Doppler system here is used to measure relative motion of another body, in this case another aircraft, for TCAS).

* at 0210, two more 2283 PFD flags

* at 0210, a 2723 rudder travel limiter fault (27 is flight controls, 2720 is the rudder control system). At higher airspeeds, the rudder travel is limited by the Rudder Travel Limiter; far less movement is allowed than at lower airspeeds.

* at 0210, a 3411 event with EFCS 2, reported by EFCS1 (3411 is the pitot/static system. I understand that on these airplanes, the system is divided into the pitot subsystem and the static subsystem).

* at 0210, a 2793 event involving EFCS 1. (27 is flight controls. I understand from colleagues that, on the A330, 2793 is the Flight Control Primary Computer, FCPC, also designated PRIM)

* at 0211, a couple more 2283 PFD flags

* at 0212, a 3410 event. A disagreement between the air data units, the AD part of the ADIRU (34 is navigation; 3410 is flight environment data). An “ADR disagree” can only occur when one of the three ADIRUs has already been designated as faulty by the FCPC, and the two remaining ADIRUs yield discrepant readings (this information from the Aircraft Operating Manual of the A330)

* at 0212, a 3422 event in the standby flight instruments (ISIS) (34 is navigation, 3422 is directional gyro and indicators)

* at 0212, a 3412 event involving IR2, the inertial reference part of ADIRU2 (34 is navigation; 3412 is the outside air temperature sensor and indicator). Reported by IR1 and IR3 and EFCS1.

* at 0213, two 2790 (EFCS) events, FCPC 1 and Secondary FCC (FCSC) 1 faults (27 is flight control; I don’t have the 2790 designator)

* at 0213, a 2283 event, reported by FMGKC1 (22 is autoflight, I understand from colleagues that 2283 is the Flight Management and Guidance Computer, FMGC)

* at 0214 a 2131 event (21 is the air conditioning, 2131 is the cabin pressure controller).

What about the ordering of these messages? First of all, they are time-stamped by the minute, so that orders them into five groups (the 0210 messages, respectively 0211, 0212, 0213, 0214). What about a finer ordering? That is going to be much harder. We don’t know whether this listed order is the order in which the messages were received (but Air France can probably tell us that). We don’t know whether the order in which the messages were received were the order in which they were transmitted (but maybe there is something in the code that can tell us that). We don’t know whether the order in which they were transmitted is the order in which they were generated (maybe Airbus can say something about that, but there might also be some indeterminacy). And, finally, we don’t know whether the order in which they were generated is the order in which the events occurred (that may be hard even for the manufacturer to say, because the rates at which values are sampled are very different, depending on the system).

For the purposes of a speculative interpretation, let me assume here that the events occurred in the order listed above. I do caution that this is quite a significant, and not necessarily correct, assumption. Let me further assume that the messages are veridical. For example, that the “ADR disagree” message really does indicate that the FCPC has ignored air data input from one ADIRU and is judging that the air data input from the other two are not consistent with each other. How significant this assumption is depends on whether one is a sceptic or an optimist about the reliability of these highly complex programmable-electronic systems and one’s trust in their design.

So here goes. The AP went off and flight control went to alternate law. Flags pop up. Autothrust disconnects, something with TCAS and then two more flags. Rudder travel limiter has a problem and then something with the pitot-static system that the EFCS’s have problems with. Sometime over a minute later we are told that the air data from one ADIRU has been designated unreliable by the FCPC and the air data from the other two disagree. Then the laser ring gyro in the ISIS complains, as do the primary and secondary flight computers (these systems are duplicated: it is the number 1 units of each that are complaining), something happens with the FMGC, and then there is a cabin pressure warning.

Why might AP go off and flight control go to alternate law? One possibility is (1) you’re being severely shaken around, or (2) for some reason the AP couldn’t maintain altitude. Another possibility is that (3) there was a system problem. Then the autothrust (AT) goes off. That would happen if, for example, that auto flight systems cannot maintain stable air speed (AS) and altitude. I don’t know what the TCAS notification would signify. Then there is a rudder travel limiter fault. That device has AS as input, so maybe there is an issue with AS sensing. Then EFCS1 thinks EFCS2 has problems with pitot-static sensing. The pitot system colludes with the static system to measure AS, and the static system is also used to measure altitude. Then EFCS 1 complains about FCPC (I take it that would be FCPC 1, also known as PRIM 1). Then two of the three remaining air data units disagree and can’t reconcile (we don’t know when the first was voted out by the FCPC 1). At a similar time, the DG in the stand-by flight instrument system complains. At a similar time, the inertial reference part of ADIRU 2 is faulted by the other two. Then unspecified faults with FCPC1 and FCSC 1, but it’s not clear which system component is reporting those faults. Then another flight control issue, and finally the cabin pressure controller squeaks.

There are some patterns here. One pattern is there is a lot of stuff involved with AS and altitude, and at least one with the outside-air-temperature sensors. The commonality here is the pitot and static systems and their interaction. Then later comes the DG in ISIS, followed by IR2 being voted out and then FCPC and FCSC faults and cabin pressure.

What could be up with the P-S systems? One possibility is that they are getting all iced up. That would be why AP and AT think they can’t maintain altitude. That might also explain the outside-air-temperature probe complaint, if it were being iced also. But manufacturers and regulators know about ice; it must have been extraordinarily severe to overwhelm the sensor heating systems.

Another possibility that some have mooted on the internet is that the aircraft was being blown around a lot in severe to extreme turbulence, but I don’t see how thereby one would get discrepant readings: rather, all probes would vary wildly, but coordinated, as individual gusts hit all three at more or less the same time. So I really don’t see that as a plausible reason for the P-S system issues.

The IR units are self-contained: they are calibrated sometime way back when and that’s it for the remainder of the flight. So when they start complaining, it is either a system fault or you are already out of control and moving them around more than they judge appropriate.

Severe icing alone overwhelming the sensor systems, though, does not by itself lead to an accident. The AC could be controlled with pitch and power, and the Aircraft Operating Manual explains exactly what pitch and what power setting in some detail, if one has an “ADR disagree” warning.

Severe turbulence, though, could cause a control problem if there are shears of more than 50-60 kts differential, because that is approximately the width of the speed band for that flight at its cleared flight level – this has been verified, using a conservative estimate of the aircraft’s weight at the time, by experienced A330 pilots (by “speed band”, I mean the difference between “maximum Mach operating” speed and stall speed). However, turbulence of that sort, while supposedly possible, is very, very unusual.

How do you get that severe icing overwhelming the PS systems? Temperature at that altitude is well below the freezing point for water, so clouds are generally formed from ice crystals. The properties of these are well known and the air data systems and their certification is aimed to cope with them, unless there is an entirely new phenomenon manifesting itself here. Ice crystals don’t show up on weather radar, so even with careful use of weather radar one might not fathom the presence of a storm whose water content is crystalline ice, no matter how violent that storm is.

The behavior of supercooled water droplets doesn’t seem to be as well understood. Water can become supercooled, even as low as -40°C (which would be a typical temperature for the flight level at which AF 447 was flying), especially in strong convective atmospheric currents. Water requires a certain amount of energy to crystallise, and if the air is cooling fast, adiabatically, that energy just might not be there. And if there is enough water, at -40°C, colliding with your sensors and freezing on impact, it may overwhelm the sensor heating and cause air data problems. However, supercooled drops are water and would show up on weather radar. One would expect a crew to avoid such an area being “painted” on their radar, especially in the Intertropical Convergence Zone (ITCZ) in which such storms are frequent, indeed expected. It is common for pilots to deviate many tens of miles from the planned track to avoid such storms, for avoiding the storm is the main priority, and use of the oceanic tracks is designed to accomodate such deviations.

So the severe-icing root-cause hypothesis is not puzzle-free.

What about some sudden, catastrophic structural-failure event such as the sudden in-flight break-up of TWA 800 in 1996? Any such hypothesis must accomodate the fact that parts of the electronics were muttering to themselves in a fairly orderly fashion, and transmitting those mutterings over a SATCOM link, for some four minutes. I don’t see how. (It is obvious that structural-failure occurred – the aircraft’s vertical stabiliser has been found separated – but, one would conclude, later in the accident sequence.)

That is enough tea-leaf reading for one note. We might hope that the BEA will explain the exact meaning of the ACARS messages, and its conclusions about their true ordering, in the interim report which, by ICAO rules, must appear within 30 days of the accident (so, by 1 July 2009).

If anyone has more detail on the exact JASC codes used by the airline and (very important!) can demonstrate to me that that information is reliable, I would be very glad to hear from you.



The Crash of Air France flight 447 on 1 June 2009: introduction

10 06 2009

On the morning of June 1, 2009, Air France Flight 447 from Rio de Janeiro in Brazil to Paris failed to make any contact with Air Traffic Control after about 0200Z (“Zulu” time is UTC, so two hours behind Paris time). The aircraft had been flying in the region of a series of significant convective storms in the Intertropical Convergence Zone (ITCZ), southwest of the Cape Verde islands.

The airline said it had received 24 ACARS messages announcing various faults with the avionics, timestamped between 0210Z and 0214Z. This was the last known communication with the aircraft. ACARS is a digital data service, in which alphanumeric data is passed between airline personnel or electronic on the ground and the aircraft avionics and crew. These messages would have been transmitted by SATCOM, satellite communications, since the aircraft was presumed out of range of VHF radio communication at the time it was lost.

The ACARS messages between these two times were typical of failure and warning messages that would be displayed to the crew on the display used for that purpose (acronym is ECAM) and logged by the Central Maintenance Computer and available to maintenance personnel and Flight Operations Quality Assurance (FOQA) personnel after the flight has landed. There was another ACARS message, presumed to have been initiated by the pilots, of some significant turbulence timestamped about ten minutes before this sequence. It is common for pilots to send reports on actual weather conditions (called PIREPS) so that, amongst other things, following flights can be informed of those conditions.

There was no other information about AF447 at all, until wreckage and some bodies were found a week after the disappearance.

The French state television channel France 2 ran a program on Thursday 4 June, in which it displayed a print-out of the ACARS messages in cryptic terminology and explained to viewers to some degree what people thought they meant. The program was posted on the TV station WWW site for a while.

This event started a flurry of interpretation on, for example, the professional pilots’ forum PPRuNe. By the weekend, Sunday 7th June, there seemed to be consensus amongst aviation experts that, if the ACARS messages were veridical and their interpretation correct, there had been some issues with the air data systems, specifically the pitot systems which measure ram-air pressure and compare it with the static air pressure measured by the static systems, in order to determine the “indicated air speed”, IAS, which is displayed to the pilots (IAS is not the same as true air speed, for it is dependent upon the density of the air through which one is flying, and it is lower than true air speed when the density is low. It is, however, that indication of speed which is directly correlated with the aerodynamics of the airplane. For example, the speed at which the airplane will stall in level flight in a particular configuration is constant when expressed as IAS, although varying with altitude when expressed as true air speed). It was suggested that maybe the aircraft had encountered icing conditions which had overwhelmed the pitot heating and iced up the pitot tubes, and that maybe this had happened while the aircraft was in severe turbulence. This could have happened had the aircraft flown into a powerful convective storm cell, and indeed there were such cells, towering up to 50,000 ft, penetrating into the stratosphere. Had the crew indeed lost reliable air speed indications, which the ACARS messages hinted at, then they would have been trying to fly the aircraft on “pitch and power”: the engine thrust is set to a given level, and the pilot flying tries to keep the nose of the aircraft pointed at a particular angle to the horizon, in this case 5° up. It can be very hard to maintain such an aircraft attitude, especially in severe turbulence, and it was supposed that the crew had finally lost control of the aircraft.

A news conference had been convened on Saturday 6 June, at which I believe the French Minister of Transport, the director of the French air accident investigation agency BEA, and a spokesperson for Air France were all present, and the information they gave substantiated this interpretation. The BEA director, M. Arslanian, did indicate that without the flight data recorder and cockpit voice recorders (FDR, CVR, the so-called “black boxes” although they are not black but, rather, dayglo orange with stripes) he was pessimistic about establishing facts definitively.

By Sunday, 7th June, there seemed to be a remarkable consensus on what was likely to have happened. The French government, BEA, Air France, and numerous professional pilots on pilot forums, all seemed to agree. And all without a shred of physical evidence. Indeed, inferring what had happened from just 24 electronic messages, whose partial interpretation had been made quasi-public. And all this within a week. Usually, even in accidents in which there is a plethora of information, such as the crash-landing of British Airways Flight 38 just short of the runway at London Heathrow airport in January 2008, arguments and discussions and differing views abound for weeks and months and sometimes years about what happened and why. Indeed, it has been nearly one and a half years since BA038 flopped onto the grass and it is still not known why. It is not unusual to wait two to four years for an accident report to be finalised.

In contrast, there seem only to be two big questions remaining about AF447 after a week: is this consensus interpretation anywhere close to the truth, and how could we possibly tell?