Don’t Panic! Effective Incident Response

A presentation at Conf42: SRE in September 2021 in by Quintessence Anx

Slide 1

Slide 1

Don’t Panic! Effective Incident Response Presented by @QuintessenceAnx DevOps Advocate 2021 Proprietary & Confidential

Slide 2

Slide 2

What We’ll Be Learning Today: After completing this training, you will be able to: • Build a foundation for an effective incident response process in your organization • Understand suggested practices needed for successful incident response • Identify practices that limit damage and reduce recovery time and costs

Slide 3

Slide 3

An incident is any unplanned disruption or event that requires immediate attention or action

Slide 4

Slide 4

Replace chaos with calm

Slide 5

Slide 5

Incident Response is an organized approach to addressing and managing an incident

Slide 6

Slide 6

The goal of Incident Response is to handle the situation in a way that limits damage and reduces recovery time and costs

Slide 7

Slide 7

To Accomplish this Goal you Must: ● Mobilize and inform only the right people at the right time ● Use systematic learning and improvement ● Work toward total automation

Slide 8

Slide 8

Based on the Incident Command System, originally developed for California wildfire response.

Slide 9

Slide 9

An incident is an unplanned disruption or event that requires immediate attention or action

Slide 10

Slide 10

A major incident requires a coordinated response between multiple teams

Slide 11

Slide 11

The 4 Commonalities of Major Incidents ● Timing is a surprise; little or no warning ● Time matters; need to respond quickly ● Situation rarely perfectly understood at the start ● Require mobilization and coordination, typically cross-functional

Slide 12

Slide 12

SEV-5 SEV-4 SEV-3 Incident SEV-2 Major Incident SEV-1

Slide 13

Slide 13

Anyone can trigger the Incident Response Process at any time

Slide 14

Slide 14

Rich Adams !ic page 11:12 11:12 Officer URL APP Paging Incident Commanders(s) Arup Chakrabarti has been paged. Paul Rechsteiner has been paged. Renee Lung has been paged. Use !ic responders to see who the team responders are. Incident triggered: https://example.pagerduty.com/incident/PD5I34R !ic page

Slide 15

Slide 15

PEACETIME WARTIME

Slide 16

Slide 16

NORMAL EMERGENCY

Slide 17

Slide 17

OK NOT OK

Slide 18

Slide 18

Decision Paralysis

Slide 19

Slide 19

People Roles & Incident Categorization Proprietary & Confidential

Slide 20

Slide 20

The Four Steps of an Incident TRIAGE MOBILIZE RESOLVE PREVENT

Slide 21

Slide 21

Roles of Incident Response COMMAND Incident Commander Scribe Deputy LIAISONS Internal Liaison Customer Liaison OPERATIONS Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME)

Slide 22

Slide 22

Setting this up at scale For a department-wide Incident Response process, you will need a few things set up to begin. This includes: ● An on-call schedule for a primary and backup Incident Commander (this role is team agnostic) ● On-call schedules for primary and backup subject matter experts (one primary and one backup for each team) ● Additional on-call rotations for other roles ● A method of paging team members (response mobilization)

Slide 23

Slide 23

Incident Response - typical sequence of events !ic page Incident Commander Subject Matter Expert (SME) Usually first individual to respond, and solve (or escalate) Internal Liaison (SME) Scribe Deputy LIAISONS External Liaison (SME) COMMAND (SME) (SME) OPERATIONS Team Bravo Team Charlie Team Delta Team Echo

Slide 24

Slide 24

How Do The Roles Scale Down? For a small team-based Incident Response process, you will need a few things set up to begin. This includes: ● An on-call schedule for primary and backup subject matter experts ● A method of paging out other team members

Slide 25

Slide 25

Small Team Incident Response Incident Commander Scribe Subject Matter Expert (SME) Subject Matter Expert (SME) Primary On-Call Backup On-Call Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME)

Slide 26

Slide 26

Incident Commander: Role and Responsibilities Proprietary & Confidential

Slide 27

Slide 27

Replace chaos with calm

Slide 28

Slide 28

Single source of reference

Slide 29

Slide 29

Gain consensus “Are there any strong objections”

Slide 30

Slide 30

Make a decision

Slide 31

Slide 31

Assign tasks to a specific person

Slide 32

Slide 32

Becomes the highest authority (Yes, even higher than the CEO)

Slide 33

Slide 33

Deep technical knowledge is not required

Slide 34

Slide 34

Handoffs are encouraged

Slide 35

Slide 35

ASK FOR STATUS FOLLOW UP ON TASK COMPLETION DECIDE ACTION GAIN CONSENSUS ASSIGN TASK

Slide 36

Slide 36

SIZE-UP VERIFY STABILIZE UPDATE

Slide 37

Slide 37

Quick Tips for New Incident Commanders ● ● ● ● ● ● Introduce yourself on the call with your name and that you are the Incident Commander Avoid acronyms Speak slowly and with purpose On the call, kick people off if they are being disruptive Time-box tasks and check in for status updates Explicitly declare when the response has ended

Slide 38

Slide 38

Summary: Importance of the Incident Commander • Keeps everyone focused • Keeps decision-making moving • Helps to avoid the bystander effect • Keep things moving towards a resolution during a major incident

Slide 39

Slide 39

Roles of Incident Response COMMAND Incident Commander Scribe Deputy LIAISONS Internal Liaison Customer Liaison OPERATIONS Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME) Subject Matter Expert (SME)

Slide 40

Slide 40

Importance of the Deputy Role • Keeps the Incident Commander focused • Takes on any and all additional tasks as necessary • Serves to follow up on reminders and ensure tasks aren’t missed • Acts as a “hot standby” for the Incident Commander

Slide 41

Slide 41

Importance of the Scribe • Documents the incident timeline and important events as they occur • The incident log will be used during the post-mortem process • Note when important actions are taken, follow-up items, and status updates • Anyone can be a Scribe

Slide 42

Slide 42

Importance of the Communications Liaison Roles • Can be external, internal, or both • Notifies customers of current conditions, and informs the Incident Commander of relevant feedback • Crafts language appropriate status updates and notification messages • Typically a member of the Support team

Slide 43

Slide 43

Incident Response Pitfalls Proprietary & Confidential

Slide 44

Slide 44

Executive Swoop

Slide 45

Slide 45

“Let’s try and resolve this in 10 minutes please!”

Slide 46

Slide 46

“Can I get a spreadsheet of all affected customers?”

Slide 47

Slide 47

“Do what I say”

Slide 48

Slide 48

KEY TAKEAWAY Do you wish to take command? …

Slide 49

Slide 49

Failure to Notify Stakeholders

Slide 50

Slide 50

Too frequent status updates

Slide 51

Slide 51

Red Herrings

Slide 52

Slide 52

Anti-Patterns ● ● ● ● ● Debating the severity of an incident during the call Discussing process and policy decisions Not disseminating policy changes Hesitating to escalate to other responders Neglecting the postmortem and follow up activities ● ● ● ● ● Trying to take on multiple roles Not disseminating policy changes Getting everyone on the call Forcing everyone to stay on the call Assuming silence means no progress

Slide 53

Slide 53

How do I prepare to manage incident response teams?

Slide 54

Slide 54

Step 1 Ensure explicit processes and expectations exist

Slide 55

Slide 55

Step 2 Practice running major incidents as a team

Slide 56

Slide 56

Step 3 Find ways to tune your processes for your teams to work

Slide 57

Slide 57

Step 4 Make Checklists

Slide 58

Slide 58

Example Checklists Start of Incident: Mobilize Response ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ Join the #incident-war-room and Zoom call Announce self as Incident Commander Acknowledge the incident Assign deputy Assign scribe Confirm liaison present Confirm SMEs present Run !ic responders to get list of oncalls on Slack Incident Response Loop ❏ ❏ Size-up the situation ❏ What’s wrong? ❏ Which systems are affected? ❏ Is this affecting multiple systems? ❏ What’s the customer impact? Stabilize the incident ❏ What actions can we take? ❏ Was there a related change or deploy? Reminders during an Ongoing Incident ❏ ❏ ❏ Suggest people leave call if they are not required SME, Scribe, Comms handoff to avoid fatigue Incident Commander Swap ❏ Ask deputy to take over ❏ Summarize status ❏ Announce change in command Incident Resolved ❏ ❏ ❏ ❏ ❏ Notify customers of resolution Scale down the response ❏ Direct all follow up to #incident-followup ❏ Announce end of incident call Resolve the PD incident Create the postmortem ❏ Assign postmortem owner Send email to incident-reports@pd.com

Slide 59

Slide 59

Don’t neglect the postmortem

Slide 60

Slide 60

Postmortems for Beginners • A Brief Overview: high level of the impact (1-2 sentences) • What happened: Detailed description, usually 1-2 paragraphs or more depending on length of response efforts • What went well? • What didn’t go so well? • Action items - if you don’t have any, what was the point of having a response?

Slide 61

Slide 61

Detailed Postmortems • Brief Overview: high level of the impact (1-2 sentences) • Contributing factors • Resolution actions • What Happened: Detailed description (usually 1-2 paragraphs, or more) • Impact: who did this affect, by how much, for how long? • What went well • Internal Messaging • What didn’t go so well • External Messaging (direct either to affected customers or all customers) • Action Items (if you don’t have any, what was the point of having a response?) • Detailed Timeline of Events

Slide 62

Slide 62

Summary • Use the Incident Command System for managing incidents • An Incident Commander takes charge during wartime scenarios • Set expectations upward • Work with your team to set explicit processes and expectations • Practice, practice, practice! • Don’t forget to review and improve

Slide 63

Slide 63

response.pagerduty.com Proprietary & Confidential

Slide 64

Slide 64

Q&A @QuintessenceAnx https://noti.st/quintessence