#4 – QUANTIFYING SOFTWARE FAILURE AND DISASTERS – (C) CAPERS JONES – SOFTWARE@RIS

Let’s visit twenty one interesting historical software failures. The idea is to analyze each failure and consider what lessons it taught, and which forms of defect prevention or defect removal might have prevented the problems. Because the failures in this section are famous and information has been published about them, they are a useful set of historical data points for retrospective quality analysis.

Among the many forms of defect prevention and removal methods are the following in alphabetical order:

Acceptance testing
Automated code static analysis for common languages
Automated text static analysis for requirements and design
Beta testing with clients
Code inspections
Component testing
Design inspections
Debugging tools
Function testing
Mathematical test case design based on design of experiments
Pair programming
Peer reviews
Performance testing
Proofs of correctness
Quality function deployment (QFD)
Regression testing
Requirements inspections
Requirements modeling
Risk-based testing
Security testing
Subroutine testing
Supply-chain testing
System testing
Unit testing
Usability testing

It is an interesting phenomenon that all of the problems in this section had gone through several kinds of testing yet the problems still occurred. A synergistic combination of pre-test inspections, pre-test static analysis, and formal mathematical testing, and risk-based testing with certified test personnel could probably have eliminated almost all of the failures discussed here.

Note that the failures and problems discussed here are only the tip of the iceberg. There are thousands of similar problems and they occur almost every day. Some forms of failure appear to be increasing in numbers and perhaps in severity. For example automotive recalls due to software problems are now frequent for every major manufacturer. There are also recalls for many other kinds of equipment with computer controls.

1962: Failure of the Mariner Navigation software

The Mariner 1 probe for Venus went off course 293 seconds after lift off. The apparent reason was that a superscript bar was missing in one line of code, which caused excessive deviations in control patterns.

Lessons learned: The primary lesson from this failure is that a single character in a single line of code can cause serious problems with software.

Problem avoidance: The problem might have been found via pair programming, code inspections, requirements modeling, or by static analysis. Neither requirements modeling nor static analysis existed in 1962 but in today’s world either method would almost certainly have found such an obvious syntactical error.

Finding the problem via testing should have occurred but obviously did not. A test sequence that included control responses to inputs should have done the job.

1978: Hartford Coliseum Collapse

The Hartford Coliseum was designed using a CAD software package. The designer assumed only vertical stress on the support columns. When one column collapsed from the weight of snow, lateral forces were applied to surrounding columns which had not been designed to take lateral stress.

Lessons learned: The lesson from this failure is more about the human mind than about software per se. The assumption of pure vertical compression was faulty, and that was a human error.

Problem avoidance: This problem could have been found via inspections and probably by requirements modeling. The problem is unlikely to have been found via static analysis since it was a problem of logic and design and not of syntax. Since the problem was a design problem pair programming might not have worked. Ideally having someone on the inspection or modeling team with experience in structures designed for heavy snow might have broadened the assumptions.

Finding the problem by testing should have occurred but there is a caveat. If the same designer with the faulty assumption wrote the test cases he or she would not have included tests for lateral stress. A certified professional tester might have found this, but perhaps not. Risk-based testing in today’s world might have found the problem.

1983: Soviet Early-Warning System

In 1983 the Soviet early-warning system falsely identified five incoming missiles which were assumed to have been launched by the United States. Rules of engagement called for a reprisal launch of missiles against the United States, which could have led to World War III.

Fortunately the Soviet duty officer was an intelligent person, who reasoned that the U.S. would never attack with only five missiles, so it was a false alarm. Apparently the early-warning system was confused by sunlight reflected from clouds.

Lessons learned: The lesson learned from this problem is that complex problems with many facets are hard to embody in software without leaving something out. A second lesson is that bugs in major military applications can have vast unintended consequences that could possibly cause the deaths of millions.

Problem avoidance: This problem might have been found by inspections with experienced military personnel as part of the inspection team. The problem might also have been found by requirements modeling.

It is unlikely that static analysis would have found the problem because it was one of logic and not a problem of syntax. Pair programming probably would not have worked either because the problem originated in requirements and design.

Finding the problem via testing obviously did not occur and it is uncertain if testing was the best solution. The problem seemed to be that there was insufficient attention paid to false positives.

1986: Therac 25 Radiation Poisoning

Between 1985 and 1987 a number of patients treated with the Therac 25 radiation therapy device received doses much higher than prescribed: some were 100 times larger.

There were two radiation levels with this machine: high power and low power. Older machines by the same company had hardware interlocks that prevented the high power mode from being turned on by accident. In the Therac 25 the hardware interlocks had been removed and replaced by software interlocks, which failed to operate under some conditions.

Worse, apparently the operating console did not inform operators when high power was in use. There was an error message and the machine stopped, but it only said “malfunction” and did not state what the problem was. Operators could then push a button to continue administering the radiation.

Because of serious injury to patients the Therac 25 problems were extensively studied by several government agencies. Readers who want a more complete discussion should do a Google search on “Therac 25” to get detailed analyses.

Lessons learned: The lessons learned from this problem are that medical devices that can kill or harm patients need state of the art quality control. The Therac 25 apparently was inept in quality control and government regulatory agencies did not perform adequate governance.

Problem avoidance: The Therac 25 problems could probably have been found by any combination of inspections, static analysis, and risk-based testing. Later investigations by government agencies found laxness in all forms of quality control. Apparently there were no formal inspections, no static analysis, no risk analysis, and testing far less rigorous than needed. Pair programming would not have worked because the problem spanned the physical operating console and inadequate training of personnel as well as software problems.

1987: Wall Street Crash

On Monday, October 19, 1987 the Dow Jones average dropped 508 points for the greatest one-day loss in history.

The problems are somewhat murky but apparently the long-running Bull market had been shaken by various SEC investigations and other reasons for loss of confidence. As live human investors began to sell stocks, programmed trading software that followed patterns began to generate so many sell orders that various stock trading systems crashed and millions of shares were put up for sale, which deepened the panic.

Lessons learned: The key lessons from this problem are that in today’s world software controls so many critical financial and government operations that bugs or errors can have vast consequences and cause problems almost instantly.

Problem avoidance: This problem might have been found by thoughtful inspections that included limits analysis. Requirements modeling might also have found the problem.

Static analysis probably would not have found the problem because it was a problem of logic and trends rather than a syntactic issue.

Testing might have found the problem but did not. Some of the modern forms of testing such as risk-based testing might have found this problem.

1990: AT&T Telephone Lines Shut Down

In 1990 a widespread shut down of AT&T telephone lines lasted for about nine hours and caused major disruption of telephone traffic. Many airline reservations could not be made and millions of calls could not be made including some emergency calls.

What seemed to have happened is that one of the AT&T 114 switching centers had a minor mechanical problem (not software) and shut down briefly. When this center came back up it sent a message generated by software to all of the other centers, which caused all of them to shut down. Apparently there was a bug in a single line of code which caused the shut down.

Lessons learned: The lesson from this failure is that large interconnected systems governed by software are intrinsically risky and need elaborate buffers and error-correction protocols.

Problem avoidance: This error could have been caught by either code inspections or by using static analysis tools. The error appears to be one of syntax rather than one of logic. Requirements modeling would not have found the error, but might have led to possibly more robust error-checking protocols. Pair programming is uncertain for this problem.

It is an interesting question as to why this problem was not found by testing, as it should have been. Messages between switching centers is an obvious topic for testing. Risk based testing with certified professional testers might have found the problem.

1991: Patriot Missile Target Error

In spite of many successes during the first Gulf War, in 1991 a Patriot missile did not stop an inbound Scud which landed in a U.S. base and killed 28 military personnel and injured 100.

The software error in the Patriot navigation and targeting routines apparently had a rounding error which threw off timing and caused the miss.

Lessons learned: The lesson learned from this problem is that every detail needs to be examined in mission-critical controls.

Problem avoidance: This problem would certainly have been found by code inspections. It might have been avoided by requirements modeling. Pair programming might have found this problem too, unless the rounding error was introduced using borrowed or reusable code falsely assumed to be correct.

It is uncertain whether static analysis would have found this error because it was not an error of syntax and might have been missed.

Testing should also have found this error, but did not. Modern risk-based testing might have identified the problem.

1993: Intel Pentium Chip Division Problem

The new Intel Pentium chip was discovered in 1993 after release to have a bug when dividing with floating point numbers. The error was small, only a fraction of a percent, and it did not actually impact very many users.

However the error was located in about 5,000,000 chips already installed and in daily use. Intel’s first response was unwise: they wanted users to prove that they needed better accuracy than the chip provided in order to get a replacement. Needless to say this caused a public relations flap of serious magnitude. Intel relented and provided new chips to anyone who asked for one, assuming the customers had purchased a computer with the erroneous chip.

Lessons learned: There are two lessons from this problem. One is obvious: be sure that all mathematical operations work as they should prior to release.

The second lesson is that when a vendor makes an error, don’t put the burden of proof on the consumer if you value your reputation and want to be considered an ethical company.

Problem avoidance: This problem would have been found by inspections. It is not likely to have been found by static analysis or requirements modeling. Pair programming probably would not have been used, nor would it have found the problem.

Since the problem was not found by testing, why not? Possibly a reason for this problem being missed by testing is combinatorial complexity. A chip with as many circuits, transistors, and features as the Pentium might require close to an infinite number of test cases to find everything.

1993: Denver Airport Delays

As originally planned the new Denver Airport was supposed to have a state-of-the-art luggage handling system that would be almost fully computerized and directed by software. What actually happened has become a classic story of a major software failure with huge costs and delays.

Overall the software and hardware problems with the luggage system delayed the opening of the airport by about 16 months and cost about $560,000,000. This is one of the few software topics that became a feature article in Scientific American Magazine.

A cut down version of the original luggage handling design was finally operational and ran for about five years. However the costs exceeded the value and eventually it was replaced by a conventional luggage system.

The problems associated with the Denver Airport luggage system are litany of 10 common problems found on large software applications: 1) Optimistic cost and schedule estimates; 2) under estimating bugs and defects; 3) no formal risk analysis; 4) excessive and poorly handled requirements changes; 5) inadequate quality control lacking inspections and static analysis; 6) poorly designed test cases; 7) serious gaps in testing; 8) progress reports that concealed major problems; 9) failure to have any effective back-up plans in place; 10) failure to listen to expert advice.

The Denver Airport is one of the most widely studied software problems in history. In spite of numerous articles and retroactive reports, it is interesting that similar problems surfaced with luggage handling at Heathrow Terminal 5.

Lessons learned: The primary lessons learned from the Denver fiasco is that optimistic estimates, poor quality control, and poor change control will inevitably lead to schedule delays, cost overruns, possible termination of the project, and almost certain litigation.

Problem avoidance: There were so many different forms of problems at Denver that no single method could have found them all. A synergistic combination of requirements modeling, pre-test requirements and design inspections, static analysis of text, static analysis of all code, formal testing based on formal test plans, and certified test personnel would probably have reduced the defects to tolerable levels. Pair programming in such a complex architecture that involved both hardware and software would have probably added confusion with little value.

1996: Ariane 5 Rocket Explosion

On its maiden flight in 1996 the Ariane 5 rocket and four on-board satellites being carried for deployment were destroyed, at a cost of perhaps $500,000,000.

The apparent reason for the problem was an attempt to convert velocity data from a 64-bit format to a 16-bit format. There was not enough space and an overflow condition occurred, thus shutting down navigation. The flight lasted just over 36 seconds.

Lessons learned: The lesson from this problem is that all mathematical operations in navigation systems need to be verified before actually launching a vehicle.

Problem avoidance: This problem would certainly have been found by code inspections, perhaps in just a few minutes. It might also have been found by static analysis and also by requirements modeling or pair programming.

The problem was obviously not found by testing, as it should have been. In this case there were probably poor assumptions on the part of whoever wrote the test scripts and test cases that overflow would not occur.

1998: Mars Climate Orbiter Crash

After successfully journeying for 286 days from earth to Mars, the climate orbiter fired its rockets to shift it into an orbit around Mars. The algorithms for these adjustments had been based on Imperial units in pounds rather than on metric units in Newtons, as specified in the NASA requirements. This error caused the orbiter to drop about 100 kilometers lower than planned, and so encountered atmospheric problems that caused overheating and system shut downs that led to the ship crashing on the surface.

Lessons learned: The key lesson here is that requirements need to be checked and understood to be sure they find their way into the code.

Problem avoidance: If inspections were used they would have found the problem almost instantly. Both requirements modeling and static analysis might also have found this problem. Pair programming might have, but if the error occurred upstream in design then pair programming might not have found it.

The reason why this problem was not found by testing may be nothing more than carelessness. Attempting to transmit data from a subroutine using Imperial units to another subroutine using metric units is about as obvious a problem as is likely to occur.

1999: Failure of the British Passport System

In 1999 the U.K. attempted to deploy a new automated passport system that had not been fully tested when it went operational. The staff using the new system had not been fully trained. Adding to the confusion was a new law which required passports for all travelers under age 16 to have passports. This law caused a huge bubble in new passport applications at the same time that the new system was deployed.

Roughly half a million passports were delayed, sometimes for weeks. This threw off travel plans for many families. In addition, the U.K. passport agency faced millions of pounds of extra costs in the form of overtime and additional personnel, plus some liability payments to travelers whose passports were late.

Lessons learned: The obvious lesson from this problem is never ever attempt to go on-line with a major new system without fully training the staff and fully testing the system jointly with the older system to be sure the new system works. Also, when a new law is passed that adds a huge bubble of new clients, be sure you have the staffing and equipment to handle the situation.

Problem avoidance: The problems with the passport system appear to be a combination of performance issues with the software and logistical problems with the passport agency itself. Putting in a new system without training the personnel in how to use it is a major management error; not a technical problem.

Neither inspections nor static analysis nor requirements modeling would have found the logistical and staffing problems, although no doubt participants in the inspections would have warned management to be careful.

Performance and load testing should have found the performance problems with the new system, but apparently were either not performed or not performed with realistic work loads.

2000: The Y2K Problem

The famous Y2K problem is classic example of short-sightedness. When computer hardware was expensive and memory space limited, it seemed like a good idea to store dates in two-digit format rather than four-digit format. Thus the year “1999” would be stored as “99.” This compression of dates started in the 1950’s and caused no problems for many years.

However since dates are often sorted in ascending or descending order, a serious problem would occur at the turn of the century. Obviously the year “2000” in two-digit form of “00” is a lower mathematical value than “99”.

Millions of software applications in every country used the two-digit date format, and sometimes used it in new applications as late as 1995 when it was clearly obvious that time was running out.

Starting in about 1995 thousands of programmers began the labor-intensive work of converting two-digit dates into four-digit dates. Fortunately the web and the internet were in full swing, because they allowed easy communication among Y2K personnel in sharing information and even sharing reusable code for affected applications. The fact that Y2K problems were not as severe as anticipated is due to the communications power of the web.

Note that Y2K was not a pure programming problem. The two-digit date fields started as explicit customer requirement, often in the face or warnings from software engineers that the dates would cause trouble.

Lessons learned: The lessons from this problem are not yet fully understood even in 2012. For example in the year 2038 the Unix internal clock will expire, and this will trigger another set of mass updates. Fairly soon digits will be added to telephone numbers. At some point digits will be added to social security numbers. Field-length problems are endemic in software, and always seem to escape notice until just before they actually happen.

Problem avoidance: The Y2K problem could have been found by almost any method including inspections, static analysis, pair programming, and testing except that two-digit dates were considered to be valid. For more than 30 years the two-digit dates were not regarded as erroneous so nobody wrote test cases to find them.

Starting in about 1995 this situation changed and not only did testing begin to look for short dates, but a number of specialized Y2K tools were built to ferret them out in legacy applications.

Although Y2K itself is now behind us, the problem of not having enough spaces for numeric information is one of the most common problems in the history of software.

2004: Shut Down of Los Angeles Airport (LAX) Air Traffic Controls

On Tuesday, September 14, 2004 near 5 pm the air-traffic controllers at LAX lost voice contact with about 400 in-flight planes. Radar screens also stopped working A total of about 800 flights were affected and had to be diverted. Needless to say this was a very serious problem. A back-up system failed about one minute after being activated. The system was out of service for around three and half hours.

The apparent cause is this problem was an internal counter that counts down from about 4 billion, and then needs to be reset. The counter was used to send messages to system components at fixed intervals. Usually it takes about 50 days to reach zero. Normally the counter was reset after 30 days but apparently that did not happen. The servers in use were from Microsoft. Apparently a scheduled reset was missed by an employee who was not fully trained.

Lessons learned: The obvious lesson is that complex systems which require human intervention to keep running will eventually fail. Several kinds of automated resets could have been designed, or control could have been passed to backup servers with different reset intervals.

Problem avoidance: This was a combination of human error and a questionable design in the servers that required manual resets. Quality function deployment (QFD) might have prevented the problem. Design inspections would certainly have found the problem. Neither pair programming nor static analysis would have identified this because of the mix of human and software.

2005: Cancellation of the FBI Trilogy Project

In or about 2000 the FBI started a major effort to improve case files and allow sharing of information. The project was called “Trilogy” and involved both hardware and software components. In 2005 the project was terminated with losses estimated at perhaps $170,000,000. One of the purposes was to move data from dozens of fragmented file systems into a unified Oracle database. The problems with this FBI system have been widely cited in the literature.

Although not specified, the probable size of the full Trilogy application would have been in the 100,000 function point size range. Failures and delays at this size level are endemic and approach 80%.

For big systems such as this requirements creep runs about 2% per calendar month during design and coding, and the total development schedules run about five years. Total scope creep can approximate a 35% increase in required functions.

Defect potentials average close to 6.0 per function point combined with cumulative defect removal efficiency levels < 85%. Make no mistake: these big systems are VERY risky and require state of the art methods to have any chance of success.

Lessons learned: This system’s failure is a textbook example of the problems or large monolithic software applications. They have rapidly changing requirements and they need careful architecture and design prior to coding. They also need a full suite of pre-test quality steps before testing even begins.

Problem avoidance: Big systems such as this need formal architecture and design phases combined with a full suite of pre-test inspections of requirements, design, and code. In fact these systems need most of the methods listed at the start of this section: formal inspections, code and text static analysis, mathematical test case design and certified test personnel. Pair programming would be cumbersome on applications with 500 or so programmers because of the expense and the training needs. Inspections are a better choice. Proofs of correctness are probably not possible due to the need for thousands of proofs.

2005: Sony Copy Protection Bug

In 2005 Sony BMG secretly placed copy protection software on 52 music CD’s. Customers who played those CD’s on their computers had the protection software installed on their equipment without their knowledge or consent.

The copy protection software used a root kit and interfered with Windows and created new security vulnerabilities on affected computers, which damaged many customers. The copy protection also slowed down computers whether or not they were playing CD’s.

When the problem was broadcast on the web, Sony BMG was sued by many indignant customers. Worse, it turned out that Sony had violated the GNU license in creating the copy protection scheme. At first there was a denial of harm by Sony but that was quickly proved to be wrong.

Sony’s next response was to issue a pseudo-removal tool that made the problems worse and caused new problems. Here too the power of the web broadcast the failures of the first attempt and caused more embarrassment to Sony BMG.

Eventually in November of 2005 Sony BMG issued a removal tool that seemed to work. The offending CD’s with the copy protection were recalled and taken off the market although some copies were found still available in stores weeks after the nominal recall.

Lessons learned: The lesson from this problem is that unless they are caught, some vendors think they can do anything they want to protect profits.

Problem avoidance: Since Sony deliberately and secretly put the flawed copy protection software into the hands of the public, it was outside of the scope of normal inspections, static analysis, requirements modeling, and every other quality control approach.

The second part of this issue is the fact that the Sony copy software was buggy and damaged host computers. This could have been found by formal design and code inspections. Neither static analysis nor pair programming would have found the upstream design issues.

What finally eliminated this problem was a combination of skilled software engineers finding out about the problem and using the web to broadcast this information to millions of others. Expertise on the part of sophisticated customers combined with the social pressure of the web was able to cause Sony to withdraw the offending copy protection scheme.

It is interesting that this problem was finally picked up by the Attorney Generals of New York, Massachusetts, Texas, California and some other states. The Federal Trade Commission (FTC) was also involved and filed a complaint.

Finally as a result of class-action lawsuits Sony paid damages to affected customers. This is not a good way for Sony to do business in a world with sophisticated clients with instant access to the web.

2006: Airbus A380 Wiring Problem

The Airbus A380 is a giant passenger plane designed to compete with Boeing 747’s on long-distance routes. The Airbus was delayed by more than year due to software problems relating to the on-board wiring harness.

Modern aircraft including the A380 are highly computerized and most controls and navigation are handled with software assistance. As a result there are miles of electrical wires and thousand of connectors. The A380 has about 550 kilometers or 330 miles of on board wiring.

The CAD design software for the A380 was by a commercial package. The German and Spanish design teams were using version 4 of the CAD package while the British and French design teams were using version 5. This caused configuration control problems.

Worse, the design team had the CAD package set up for copper wires but aluminum wires were used in the wiring harness for the wings. The difference between aluminum and copper caused other problems because the diameter of the wires were not the same, nor was elasticity the same. It is harder to bend aluminum wires than copper wires.

Lessons learned: The primary lesson learned from this problem is that multiple design teams in multiple countries should all be using the same versions of CAD packages and any other complex technical tools. A second lesson is that when you use software to model physical equipment such as wire diameters and elasticity, be sure to have the software match the physical components.

Problem avoidance: The differences between copper and aluminum wiring could easily have been found be design and code inspections prior to final approval on the design. They might also have been found by requirements modeling. This is not a kind of problem where code static analysis might have found the problem, but perhaps a text static analysis tool could have identified it before serious harm was done. The damage was done by using the wrong settings on a CAD tool so pair programming would not have found the issue.

The use of testing for this problem was not really in the mix because the problem manifested itself in physical problems noted during construction of the plane. Basically the aluminum wires were too big for some holes and too stiff to bend around obstructions.

2010: McAfee Anti-Virus Bug Shuts Down Computers

In 2010 the well-known McAfee anti-virus package had a new update. A bug in this update caused the McAfee software to identify part of the Windows XP operating system as a malicious file, which had the affect of shutting down thousands of computers that were running XP at the time.

(This bug was front page news in the author’s state of Rhode Island because it caused the suspension of surgical procedures in a number of Rhode Island hospitals. Schedules and contact information for physicians and nurses became unavailable when the computers stopped working.)

Lessons learned: The lesson from this problem is to be sure that all releases of software are properly regression tested prior to release.

Problem avoidance: The bug would certainly have been found by means of code inspections. It might have slipped through static analysis tools because it was a logical error rather than a syntactic error. Requirements modeling might have found the problem, but was not used.

Clearly testing should have found the problem but obviously did not. The probable reason is informal test case design rather than rigorous mathematically-based test case design.

2011: Bankruptcy of Studio 38 in Rhode Island

In 2010 the Economic Development Commission (EDC) of the State of Rhode Island agreed to loan $75,000,000 to the Studio 38 game company owned by former Red Sox pitcher Curt Schilling. In return the company moved to Providence and began operations with about 250 employees.

As is common with software applications in the $75,000,000 cost range the main product of Studio 38 ran behind schedule. As is also common with start ups, the Studio 38 company itself ran low on funds and fell behind its payments to the State.

In the absence of fresh capital from either film credits, external investors, the State, or other sources, the company missed payrolls, ran out of funds, laid off the entire staff, and then declared bankruptcy.

Looking at the history of what happened prior to the bankruptcy, there was no due diligence or risk analysis by the state prior to the loan. Once the loan was given, there was no effective governance. Both should have been done. It is easy to generate a risk analysis for software packages that cost about $75,000,000 to develop. That is a very risky region with many failures.

Using industry defaults for projects with about 250 people and a development schedule of 42 months, they are almost always late and over budget. There were no contingency plans for this. A retrospective risk analysis by the author after the bankruptcy showed:

Risks of cancellation of the project 36.76%
Risk of negative return on investment 46.56%
Risk of schedule delays 49.01%
Risk of cost overruns 41.66%
Risk of unhappy customers 56.37%
Risk of litigation 17.15%
Average of all risks 41.25%
Financial risks 88.22%

These risks are so high that it was folly to invest more than $75,000,000 without seeing a very detailed risk abatement plan provided by Studio 38. Software start ups of this size are among the most risky ventures of the modern era.

Lessons learned: The main lesson from the Studio 38 failure is that governments have no business trying to operate as venture capitalists in an industry where they have no experience or expertise.

Problem avoidance: Large software applications with teams of 250 people routinely run late by 6 to 12 months. These delays might have been reduced or minimized by better quality control up front, such as inspections and static analysis. Over and above normal delays, this project had no effective back up or contingency plans on how to get additional funds once the initial loan ran out.

Normal venture investments are preceded by a careful due diligence process that examines risks and benefits. Apparently Rhode Island ignored due diligence and risks and was blinded by potential benefits.

2012: Knight Capitol Stock Trading Software Problems

On Wednesday, August 1, 2012 a software bug in the Knight Capitol stock trading software triggered a massive problem that involved 140 stocks fluctuating wildly. One of the additional problems was that the stock trading software had no “off switch” and could not easily be shut down.

This is a cautionary tale about how software bugs can potentially damage national and global economies. The problem was almost immediately recognized as a software bug, but in an industry where millions of dollars of stocks change hands every minute it took more than 30 minutes to stop programmed trading with the software. Apparently the problem was in a new update installed on the day of the problem, clearly without adequate testing or validation.

Rogue trading software or major bugs in trading software has the theoretical potential of damaging the entire world’s financial systems. Knight Capitol’s own stock declined by about 77% due to this software problem (serves them right). There may also be future litigation from stock purchasers or companies who feel that they were damaged by the event. The Securities and Exchange Commission (SEC) called for a meeting to examine the problem.

Lessons learned: The major lesson from the Knight Capitol software bug is that financial software in the United States needs much stronger governance than it gets today. Financial applications should have, but do not have, the same kinds of certification that is required of medical applications by the FDA and avionics applications by the FAA. In all three cases bugs or errors can cause enormous and totally unpredictable damages. In the case of medical and avionics software deaths can occur. In the case of financial software, national or even global malfunctions of the economy might occur.

Problem avoidance: In thinking about the Knight Capitol software problems, formal inspections and requirements modeling are the two methods with the highest probability of finding the problems. Static analysis would probably have missed it since the issue was a logical omission rather than a syntactic problem. Pair programming might not have worked because the problem seems to originate upstream in requirements and design.

Deeper analysis is needed to find out why testing did not identify the problem, but the obvious reasons are casual test case design, lack of risk based testing, and probably testing by amateurs instead of certified test personnel.

2012 Automotive Safety Recalls Due to Software

The original intent of this discussion was to show the specific software recalls for a single automobile line such as Toyota. However the web has so many stories of software recalls involving so many automobiles that it is becoming an automotive industry scandal. Software now controls fuel injection, brakes, automobile engines, navigation packages, and other systems. Any or all of these software controlled features can malfunction.

Within the past few years numerous recalls for software bugs have occurred in automobiles by Cadillac, Ford, General Motors, Honda, Jaguar, Lexus, Nissan, Pontiac, Toyota and others. Some of these involve the same components but others are unique. Here are a few samples of very troubling automotive recalls in alphabetic order:

Buick recalled the LaCrosse automobiles due to software controlling brakes. A separate recall for the same model was due to software handling climate control which could affect visibility.

Cadillac recalled SRX automobiles due to a software problem with air bags.

Daimler recalled delivery trucks in 2011 due to a software problem that caused the outside turn and indicator lights to stop working after perhaps 10 minutes of operation.

Ford recalled several 2011 truck models due to software problems with an integrated diagnostic system (IDS) module.

Honda CR-Z hybrids were recalled in 2011 because the electric motor could reverse itself and turn in the opposite direction from the transmission.

Jaguar recalled some of their diesel models made between 2006 and 2010 because a software bug prevented cruise control from being turned off. The engine had to be stopped to turn off the cruise control.

Four cylinder Accords were recalled due to software problems controlling their automatic transmissions.

Nissan recalled some of the electric Leaf models due to software problems with air conditioning.

Toyota Prius models were recalled due to a software problem that caused the gas engines to stall. The electric motor could be used to pull off the highway or go short distances. This problem affected 2004 and 2005 Prius models. In states with “lemon laws” some owners were entitled to replacement vehicles.

Toyota Prius and some Lexus hybrids were recalled in 2010 due to a software problem that caused a delay between pressing the brake pedal and the brakes actually working.

(Steve Wozniak the Apple co-founder owned a Prius and asserted that the dangerous acceleration problem was due to software rather than a mechanical problem. Toyota disputed the claim, but probably Steve Wozniak knows more about software than most people.)

Volvo recalled a number of 2012 of S60 sedans due to software problems with the fuel pumps.

Lessons learned: Automobiles are now sophisticated devices with a number of on-board computers and many systems either directly controlled by software or assisted by software. Therefore automobile manufacturers should adopt a full suite of modern defect prevention and defect removal steps.

Problem avoidance: Because so many automotive features and controls are now affected by software, many software quality control methods are needed. These include quality function deployment (QFD), pre-test requirements, design and code inspections, static analysis of text, and static analysis of code. Testing should be formal with mathematically designed test cases, and performed by certified test personnel.

Over the past 10 years about 10,000,000 automobiles have been recalled due to software-related problems. One warranty company reported that about 27% of repairs are related to computer and software malfunctions. More analysis and better data across all automobile manufacturers are needed.

Summary and Conclusions about Software Problems

In the modern world computers and software are the critical operating components of aircraft, medical devices, stock trading, banking, business, and government. Since software controls so many critical activities, it should be obvious that quality control is a key topic that needs to be fully understood and to use state-of-the-art methods.

But in the problems shown here, and the thousands of similar problems that occur with other systems, quality control was primitive and inept. The executives of the companies that produce bad software need to realize that quality problems are serious enough so that litigation and damages can possibly cause bankruptcy even for major corporations.

A simplistic reliance on testing and failure to perform pre-test inspections or use static analysis is not an adequate response and does not lead to effective quality control. The industry needs to deploy a full suite of synergistic quality methods that include pre-test inspections, pre-test static analysis of text and code, and formal mathematically based testing by certified test personnel. Anything less for mission-critical software applications can lead to the same kinds of problems discussed in this section.

CERM ® RISK INSIGHTS

Future of Quality: Risk™

#4 – QUANTIFYING SOFTWARE FAILURE AND DISASTERS – (C) CAPERS JONES – SOFTWARE@RIS

Leave a Reply Cancel reply