To Err is Human: The Complexity of Security Failures (Keynote)

A presentation at Hacktivity in October 2019 in Budapest, Hungary by Kelly Shortridge

Slide 1

Slide 1

TO ERR IS HUMAN: The Complexity of Security Failures Kelly Shortridge (@swagitda_) Hacktivity 2019

Slide 2

Slide 2

Hi, I’m Kelly @swagitda_

Slide 3

Slide 3

“To err is human; to forgive, divine.” – Alexander Pope @swagitda_

Slide 4

Slide 4

Humans make mistakes. It’s part of our nature (it’s mostly a feature, not a bug) 4 @swagitda_

Slide 5

Slide 5

Infosec’s mistake: operating as if you can force humans to never err 5 @swagitda_

Slide 6

Slide 6

This forces us into a futile war against nature. We cannot bend it to our will. 6 @swagitda_

Slide 7

Slide 7

To build secure systems, we must work with nature, rather than against it. 7 @swagitda_

Slide 8

Slide 8

Clearing the Err Hindsight & Outcome Bias Unhealthy Coping Mechanisms Making Failure Epic 8 @swagitda_

Slide 9

Slide 9

Clearing the Err

Slide 10

Slide 10

Error: an action that leads to failure or that deviates from expected behavior 10 @swagitda_

Slide 11

Slide 11

Security failure: the breakdown in our security coping mechanisms 11 @swagitda_

Slide 12

Slide 12

“Human error” involves subjective expectations, including in infosec 12 @swagitda_

Slide 13

Slide 13

Understanding why incidents happened is essential, but blame doesn’t help 13 @swagitda_

Slide 14

Slide 14

Aviation, manufacturing, & healthcare are already undergoing this revolution 14 @swagitda_

Slide 15

Slide 15

Slips (unintended actions) occur far more than mistakes (inappropriate intentions) 15 @swagitda_

Slide 16

Slide 16

The term “human error” is less grounded to reality than we believe… 16 @swagitda_

Slide 17

Slide 17

Hindsight & Outcome Bias

Slide 18

Slide 18

Cognitive biases represent mental shortcuts that are optimal for evolution 18 @swagitda_

Slide 19

Slide 19

We learn from the past to progress, but our “lizard brain” can take things too far 19 @swagitda_

Slide 20

Slide 20

Hindsight bias: the “I knew it all along” effect aka the “curse of knowledge” 20 @swagitda_

Slide 21

Slide 21

People overestimate their predictive abilities when lacking future knowledge 21 @swagitda_

Slide 22

Slide 22

e.g. skepticism of N.K. attribution for the Sony Pictures leak; now it is “obvious” 22 @swagitda_

Slide 23

Slide 23

Outcome bias: judging a decision based on its eventual outcome 23 @swagitda_

Slide 24

Slide 24

Instead, evaluate decisions based on what was known at that time 24 @swagitda_

Slide 25

Slide 25

All decisions involve some level of risk. Outcomes are largely based on chance. 26 @swagitda_

Slide 26

Slide 26

We unfairly hold people accountable for events beyond their control 27 @swagitda_

Slide 27

Slide 27

e.g. CapitalOne – did the breach really represent a failure in their strategy? (No.) 28 @swagitda_

Slide 28

Slide 28

These biases change how we cope with failure… 29 @swagitda_

Slide 29

Slide 29

Unhealthy Coping Mechanisms

Slide 30

Slide 30

Unhealthy coping mechanism #1: Blaming “human error” 31 @swagitda_

Slide 31

Slide 31

Infosec’s fav hobbies: PICNIC & PEBKAC 32 @swagitda_

Slide 32

Slide 32

This isn’t about removing accountability — malicious individuals certainly exist 33 @swagitda_

Slide 33

Slide 33

Fundamental attribution error: your actions reflect innate traits, mine don’t 34 @swagitda_

Slide 34

Slide 34

“You are inattentive, sloppy, & naïve for clicking a link. I was just super busy.” 35 @swagitda_

Slide 35

Slide 35

An error represents the starting point for an investigation, not a conclusion 36 @swagitda_

Slide 36

Slide 36

“Why did they click the link?” “Why did clicking a link lead to pwnage?” 37 @swagitda_

Slide 37

Slide 37

These questions go unanswered if we accept the “human error” explanation 38 @swagitda_

Slide 38

Slide 38

e.g. training devs to “care about security” completely misses the underlying issue 39 @swagitda_

Slide 39

Slide 39

A “5 Whys” approach is a healthy start 40 @swagitda_

Slide 40

Slide 40

Equifax’s ex-CEO blamed “human error” for the breach. He was wrong. 41 @swagitda_

Slide 41

Slide 41

What about frictional workflows, legacy dependence, org pressures for uptime? 42 @swagitda_

Slide 42

Slide 42

90% of breaches cite “human error” as the cause. That stat is basically useless. 43 @swagitda_

Slide 43

Slide 43

Bad theory: if humans are removed from the equation, error can’t occur 44 @swagitda_

Slide 44

Slide 44

Unhealthy coping mechanism #2: Behavioral control 45 @swagitda_

Slide 45

Slide 45

“An approach aimed at the individual is the equivalent of swatting individual mosquitoes rather than draining the swamp to address the source of the problem.” – Henriksen, et al. 46 @swagitda_

Slide 46

Slide 46

“Policy violation” is a sneaky way to still rely on “human error” as an answer 47 @swagitda_

Slide 47

Slide 47

The cornucopia of security awareness hullabaloo is a direct result of this 48 @swagitda_

Slide 48

Slide 48

Solely restricting human behavior will never improve security outcomes. 49 @swagitda_

Slide 49

Slide 49

We focus on forcing humans to fit our ideal mold vs. re-designing our systems 50 @swagitda_

Slide 50

Slide 50

Formal policies are rarely written by those in the flow of work being policed 51 @swagitda_

Slide 51

Slide 51

Infosec is mostly at the “blunt” end of systems; operators are at the “sharp” end 52 @swagitda_

Slide 52

Slide 52

People tend to blame whomever resides closest to the error 53 @swagitda_

Slide 53

Slide 53

Operator actions “add a final garnish to a lethal brew whose ingredients have already been long in the cooking.” – James Reason 54 @swagitda_

Slide 54

Slide 54

e.g. Equifax’s 48-hour patching policy that was very obviously not followed 55 @swagitda_

Slide 55

Slide 55

Creating words on a piece of paper & expecting results is… ambitious 56 @swagitda_

Slide 56

Slide 56

Discipline doesn’t actually fix the “policy violation” cause (but it does scapegoat) 57 @swagitda_

Slide 57

Slide 57

Case study: SS&C & BEC 58 @swagitda_

Slide 58

Slide 58

Solely implementing controls to regulate human behavior doesn’t beget resilience 59 @swagitda_

Slide 59

Slide 59

Post-WWII analysis: Improved design of cockpit controls won over pilot training 60 @swagitda_

Slide 60

Slide 60

Communicate expert guidance, but tether it to reality 61 @swagitda_

Slide 61

Slide 61

Checklists can be valuable aids if they’re based on knowledge of real workflows 62 @swagitda_

Slide 62

Slide 62

Policies must encourage safer contexts, not lord over behavior with an iron fist. 63 @swagitda_

Slide 63

Slide 63

Unhealthy coping mechanism #3: The just-world hypothesis 64 @swagitda_

Slide 64

Slide 64

Attempting to find the ultimate causal seed of failure helps us cope with fear 65 @swagitda_

Slide 65

Slide 65

The just world hypothesis: humans like believing the world is orderly & fair 66 @swagitda_

Slide 66

Slide 66

The fact that the same things can lead to both success & failure isn’t a “just world” 67 @swagitda_

Slide 67

Slide 67

Case Study: The Chernobyl disaster 68 @swagitda_

Slide 68

Slide 68

Errors are really symptoms of pursuing goals while under resource constraints 69 @swagitda_

Slide 69

Slide 69

How can security teams more productively deal with security failures? 70 @swagitda_

Slide 70

Slide 70

Making Failure Epic

Slide 71

Slide 71

Infosec will progress when we ensure the easy way is the secure way 72 @swagitda_

Slide 72

Slide 72

System perspective Security UX Chaos security engineering Blameless culture 73 @swagitda_

Slide 73

Slide 73

System perspective

Slide 74

Slide 74

Security failure is never the result of one factor, one vuln, or one dismissed alert 75 @swagitda_

Slide 75

Slide 75

Security must expand their focus to look at relationships between components 76 @swagitda_

Slide 76

Slide 76

A system is “a set of interdependent components interacting to achieve a common specified goal.” 77 @swagitda_

Slide 77

Slide 77

“A narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents” – Nancy Leveson 78 @swagitda_

Slide 78

Slide 78

The way humans use tech involves economic & social factors, too 79 @swagitda_

Slide 79

Slide 79

Economic factors: revenue & profit goals, compensation schemes, budgeting, etc. 80 @swagitda_

Slide 80

Slide 80

Social factors: KPIs, expectations, what behavior is rewarded or punished, etc. 81 @swagitda_

Slide 81

Slide 81

Pressure to do more work, faster is a vulnerability. So is a political culture. 82 @swagitda_

Slide 82

Slide 82

Non-software vulns don’t appear in our threat models, but also erode resilience 83 @swagitda_

Slide 83

Slide 83

We treat colleagues like Schrödinger’s attacker vs. dissecting org-level factors 84 @swagitda_

Slide 84

Slide 84

Security is something a system does, not something a system has. 85 @swagitda_

Slide 85

Slide 85

Think of it as helping our systems operate safely vs. “adding security” 86 @swagitda_

Slide 86

Slide 86

Health & “security vanity” metrics don’t say whether systems are doing security 87 @swagitda_

Slide 87

Slide 87

Number of vulns found matters less than their severity & how quickly they’re fixed 88 @swagitda_

Slide 88

Slide 88

Infosec should analyze the mismatch between self-perception & reality 89 @swagitda_

Slide 89

Slide 89

Alternative analysis for defenders is basically just user research… 90 @swagitda_

Slide 90

Slide 90

Security UX

Slide 91

Slide 91

The pressure to meet competing goals is a strong source of security failure 92 @swagitda_

Slide 92

Slide 92

What drives their promotion or firing? What are their performance goals? 93 @swagitda_

Slide 93

Slide 93

Human attention is a finite & precious resource, so you must compete for it 94 @swagitda_

Slide 94

Slide 94

User research can help you determine how to draw attention towards security 95 @swagitda_

Slide 95

Slide 95

96 @swagitda_

Slide 96

Slide 96

WARNING: CYBER ANOMALY (thanks Raytheon) 97 @swagitda_

Slide 97

Slide 97

Choice architecture: organizing the context in which people make decisions 98 @swagitda_

Slide 98

Slide 98

Place secure behavior on the path of least resistance by using defaults 99 @swagitda_

Slide 99

Slide 99

e.g. Requiring 2FA to create an account, security tests in CI/CD pipelines 100 @swagitda_

Slide 100

Slide 100

Slips require changes to the design of systems with which humans interact 101 @swagitda_

Slide 101

Slide 101

Checklists, defaults, eliminating distractions, removing complexity… 102 @swagitda_

Slide 102

Slide 102

Strong security design anticipates user workarounds & safely supports them 103 @swagitda_

Slide 103

Slide 103

e.g. Self-service app approvals with a Slackbot to confirm the run request 104 @swagitda_

Slide 104

Slide 104

Think in terms of acceptable tradeoffs – create secure alternatives, not loopholes 105 @swagitda_

Slide 105

Slide 105

How else can you better understand your systems & the context they create? 106 @swagitda_

Slide 106

Slide 106

Chaos Security Engineering

Slide 107

Slide 107

We will never be able to eliminate the potential for error. 108 @swagitda_

Slide 108

Slide 108

We must seek feedback on what creates success & failure in our systems 109 @swagitda_

Slide 109

Slide 109

“Enhancing error tolerance, error detection, and error recovery together produce safety.” – Woods, et al 110 @swagitda_

Slide 110

Slide 110

Error tolerance: the ability to not get totally pwned when compromise occurs 111 @swagitda_

Slide 111

Slide 111

Error detection: the ability to spot unwanted activity 112 @swagitda_

Slide 112

Slide 112

Error recovery: the ability to restore systems to their intended functionality 113 @swagitda_

Slide 113

Slide 113

Highest ROI: anticipating how the potential for failure evolves 114 @swagitda_

Slide 114

Slide 114

Chaos eng: continual experimentation to evaluate response to unexpected failure 115 @swagitda_

Slide 115

Slide 115

e.g. Retrograding: inject old versions of libs, containers, etc. into your systems 116 @swagitda_

Slide 116

Slide 116

Chaos engineering assumes existing knowledge hangs in a delicate balance 117 @swagitda_

Slide 117

Slide 117

The potential for hazard is constantly changing, creating new blindspots 118 @swagitda_

Slide 118

Slide 118

If you don’t understand your systems, you can’t ever hope to protect them 119 @swagitda_

Slide 119

Slide 119

Chaos security engineering requires a blameless culture… 120 @swagitda_

Slide 120

Slide 120

Blameless Culture

Slide 121

Slide 121

A blameless culture balances safety and accountability – not absolution 122 @swagitda_

Slide 122

Slide 122

Supports a perpetual state of learning, in which critical info isn’t suppressed 123 @swagitda_

Slide 123

Slide 123

Asking the right questions is the first step towards a blameless culture 124 @swagitda_

Slide 124

Slide 124

Neutral questions prevent bias from seeping into our incident review 125 @swagitda_

Slide 125

Slide 125

Ask other practitioners what they would do in the same original context 126 @swagitda_

Slide 126

Slide 126

Case study: the stressed accountant 127 @swagitda_

Slide 127

Slide 127

“Human error” becomes a reasonable action given the human’s circumstances 128 @swagitda_

Slide 128

Slide 128

Your security program is set up to fail if it blames humans for reasonable actions 129 @swagitda_

Slide 129

Slide 129

Neutral practitioner questions help sketch a portrait of local rationality 130 @swagitda_

Slide 130

Slide 130

“Irrational behavior” is only irrational when considered without local context 131 @swagitda_

Slide 131

Slide 131

Our goal is to change the context of decision-making to promote security 132 @swagitda_

Slide 132

Slide 132

If you’re using an ad-hominem attack in incident review, you’ve veered astray 133 @swagitda_

Slide 133

Slide 133

In Conclusion

Slide 134

Slide 134

Discard the crutch of “human error” so you can learn from failure 135 @swagitda_

Slide 135

Slide 135

Always consider the messiness of systems, organizations, and minds 136 @swagitda_

Slide 136

Slide 136

You aren’t exempt – your own emotions play a part in these systems 137 @swagitda_

Slide 137

Slide 137

Work with human nature rather than against it, and think in terms of systems 138 @swagitda_

Slide 138

Slide 138

Leverage UX & chaos eng to improve the context your systems engender 139 @swagitda_

Slide 139

Slide 139

Ask neutral questions & ensure your teams feel safe enough to discuss errors 140 @swagitda_

Slide 140

Slide 140

Infosec is erring. But we still have the chance to become divine. 141 @swagitda_

Slide 141

Slide 141

“We may encounter many defeats, but we must not be defeated. It may even be necessary to encounter the defeat, so that we can know who we are. So that we can see, oh, that happened, and I rose.” – Maya Angelou 142 @swagitda_

Slide 142

Slide 142

@swagitda_ /in/kellyshortridge kelly@greywire.net 143 @swagitda_

Slide 143

Slide 143

Suggested Reading • “The evolution of error: Error management, cognitive constraints, and adaptive decision-making biases.” Johnson, D., et al. • “Hindsight bias impedes learning.” Mahdavi, S., & Rahimian, M. A. • “Outcome bias in decision evaluation.” Baron, J., & Hershey, J. C. • “Human error.” Reason, J. • “Behind human error.” Woods, D., et al. • “People or systems? To blame is human. The fix is to engineer.” Holden, R.J. • “Understanding adverse events: a human factors framework.” Henriksen, K., et al. • “Engineering a safer world: Systems thinking applied to safety.” Leveson, N. • “‘Going solid’: a model of system dynamics and consequences for patient safety.” Cook, R., Rasmussen, J. • “Choice Architecture.” Thaler, R. H., Sunstein, C.R., Balz, J.P. • “Blameless PostMortems and a Just Culture.” Allspaw, J. 144 @swagitda_