Observability, Incident Response, and Common Ground that Binds

A presentation at CMG ObservabilityCon in September 2022 in by matt davis

Slide 1

Slide 1

Matt Davis SRE Advocate Blameless Observability, Incident Response, & Common Ground that Binds

Slide 2

Slide 2

How Complex Systems Fail Dr. Richard Cook 12. Human practitioners are the adaptable element of complex systems. Practitioners and first line management actively adapt the system to maximize production and minimize accidents.

Slide 3

Slide 3

Common Ground & Joint Activity

Slide 4

Slide 4

Joint Activity Parties intend to work together, not in parallel. ~ Their work is interdependent. ~ All are committed to common goals. ~ Collaborative choreography is achieved through effective coordination. Interpredictability [ Reciprocity ] Directability [ Reframing ] Common Ground [ Group Intuition ]

Slide 5

Slide 5

Common Ground qualities of grounding A process of communicating, testing, updating, tailoring, & repairing mutual understandings & mental models, consisting of: - Initial Conditions - Status of what has transpired - Changes of knowledge since we started -

Slide 6

Slide 6

Common Ground types of knowledge, beliefs, assumptions ~ Each participant’s Role ~ ~ Routines the team can handle ~ ~ Expertise of each person ~ ~ Each person’s Stance: their perception of production pressure, level of fatigue, cognitive weight, competing priorities ~

Slide 7

Slide 7

Common Ground key aspects for incidents A. We commit to continually inspect and adjust Common Ground as we work towards remediation. B. We share the types of knowledge, beliefs, and assumptions at work. C. We recognize coordination signals.

Slide 8

Slide 8

The Response Trio

Slide 9

Slide 9

Incident Command adaptive choreography

Slide 10

Slide 10

Incident Command responsibilities

  1. Gather Knowledge 2. Organize Resources 3. Make Informed Decisions 4. Support Common Ground IC owns the Response… not the Incident.

Slide 11

Slide 11

Incident Command adaptive choreography Command remains flexible to an emerging situation and guides the response to remediation, giving Problem Solvers autonomy to selfcoordinate and make local decisions as needed.

Slide 12

Slide 12

for the Interoperation of Distributed Software, Incidents need: Machine Observability for the Reciprocity & Interpredictability of People, Incidents also need: Human Observability

Slide 13

Slide 13

The Human Observability of Incident Command

Slide 14

Slide 14

Dynamics of Reciprocity The ability to Dr. Laura Maguire “look in and listen in” has been widely documented as a benefit to smooth coordination. - Managing the Hidden Costs of Coordination, ACM Queue

Slide 15

Slide 15

Human Observability support common grounding Listen. Get the big picture by asking others their perspective. Be attentive to peoples’ ability to respond while keeping a regular tempo with check-ins. Update. When new people join, give them a status. When a mitigation is made, announce it. Feel like you are overcommunicating. Delegate when you need help updating. Guide. Get multiple perspectives for making decisions. Politely keep people on-topic. Suggest next steps or alternatives to engage minds.

Slide 16

Slide 16

Human Observability support common grounding Monitor. Recognize signals like: questions around confusion, side conversations, huddles without updates, lack of response from a Role, hypothesis testing goes too long, a Problem Solver becoming fatigued. Repair. Get knowledge flowing by asking questions. Redirect threads to the main chat. Delegate a Scribe. Pull participants together for a no-work status chat/meeting. Practice. Use dedicated collaboration sessions outside of incidents to build personal relationships and group intuition. Provide a safe, unpressurized space for people to feel vulnerable with their peers.

Slide 17

Slide 17

Human Observability takeaways for incidents A. Share and update mental models to support common goals. B. Maintain transparency to support reciprocity and signals. C. Be flexible enough to change course and recalibrate when breakdowns occur.

Slide 18

Slide 18

References 1. Cook, Richard. How Complex Systems Fail. https://how.complexsystems.fail. 2. Klein, G., Feltovich, P. J., Bradshaw, J. M., Woods, D. D. 2004. Common ground and coordination in joint activity. http://jeffreymbradshaw.net/publications/ Common_Ground_Single.pdf. 3. Maguire, Laura. Managing the Hidden Costs of Coordination. https://queue.acm.org/detail.cfm?id=3380779.

Slide 19

Slide 19

Q&A matt davis @dtauvdiodr