One area that does not receive much emphasis in risk management is the human factor. In risk assessments, risk events, likelihoods and consequences, vulnerabilities are the usual focus. People are viewed as ‘weak links’ in risk prevention, but what about risk mitigation? Your risk planning depends on people to respond when an event occurs. How good is their risk decision-making under stress? There is the weak link.
In 1996, IEEE published a book on “Probabilistic Risk Assessment and Management for Engineers and Scientists” by Kumamoto and Henley[1]. This book is about using probability to assess reliability and safety risks in an industrial environment. The book introduces some interesting concepts, such as risk perception and ‘Human Reliability’.
PERCEPTION OF RISK
One factor they emphasize in reliability and error rates is the perception of risk. Perception influences risk attitude. Regarding risk aversion, they state that people are ambivalent towards risk consequences; events that are scattered and occur over time are ignored while a single event with a large impact creates a significant outcry and response[2]. The perception of risk is much greater when a major event occurs such as a school shooting and a bombing of a crowd at a sports event.
After a significant risk event occurs people become sensitized to that risk. Their perception of risk is altered, which can lead to a mis-assessment of risk: the frequency of occurrence, the severity of impact, and finally the likelihood that a risk will occur can all can become overestimated. This often causes a loss of public confidence in the organization responsible for managing the risk. Think of some relatively recent events in the news: Toyota’s acceleration problem, BP oil well disaster, fertilizer plant explosion in Texas, oversized truck taking down a bridge in Washington state. The public’s perception of these risks was changed as a result.
HUMAN RELIABILITY
So how does Human Reliability fit into this?
The types of tasks involved can affect how you deal with risk. In their book, Kumamoto and Henley note that a routine task “is a sequence of unit activities such as selection, reading, interpretation, or manipulation. There are at least two types of human error: omission and commission. In a commission, a person performs an activity incorrectly. Omission of an activity, of course, is an omission error. An incorrectly timed sequence of activities is also a commission error.” [3]
They further describe the effects of people in the risk management process, omission or commission errors made in routine operational test or maintenance, or in following procedure, which precipitate a risk event.
This issue of human error is not restricted to industrial operations, it also is a large factor in information systems reliability. Misconfiguration errors in settings and changes according to the IT Process Institute account for 80% of unplanned outages. The Enterprise Management Association reports that 60% of availability and performance errors are the result of misconfigurations. [5]
Gartner has projected that 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.” [6]
RISK RESPONSE
During a risk event, human intervention may be required for several reasons: detecting/confirming that an event is occurring, diagnosing the situation, and determining the proper response.
Human are notorious for screwing up. Detecting and diagnosing requires first that someone is paying attention to what is going on and secondly that they have the presence of mind to analyze the situation and not panic. Stress levels, experience, skill are factors. As are whether the actions required are based on skills, rules of procedure, or tribal knowledge. Trying to remember or look up procedures and accessing tribal knowledge slow the response.
Kumamoto and Henley cite several failure modes[4]: the “garden path” approach where a sole response is followed despite any contrary evidence; the “vagabond” approach where there is minimal analysis so first one explanation after another is chosen as the solution; and the “hamlet” approach, where every possible explanation is analyzed, resulting in a slow response to the situation.
Further, humans can effect unsuccessful responses to a risk event: a non-response (lack of detection, improper diagnosis), forgetting to execute a step in the response (omission error), misdiagnosis (where a wrong response makes the event worse), or execute the wrong action in the response (commission error). These may result in worsening the risk event instead of mitigating it.
So when you make your risk management and mitigation plan, remember the rest of the story – look at the human risks both in normal operations, and in how people might react when the risk event occurs.
References
[1] “Probabilistic risk assessment and management for engineers and scientists”, Hiromitsu Kumamoto, Ernest J. Henley, -2nd ed., IEEE Press, 1996.
[2] Ibid; Ch. 1, section 1.4.1.
[3] Ibid; Ch. 10, section 10.5.1.
[4] Ibid; Ch. 10, section 10.2.
[5] “Downtime, Outages and Failures – Understanding Their True Costs”, http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html
[6] “Configuration Management for Virtual and Cloud Infrastructures”, Ronni J. Colville, George Spafford, http://www.rbiassets.com/getfile.ashx/42112626510