Build resilience using learnings from incident communication patterns (DOP107)

Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:

Varying Mental Models

  • People can look at the same window but see different things - one might notice a beautiful tree, while another sees a cool car.
  • Without each other's perspectives, the mental model is incomplete, which is often seen in incident reviews.
  • The "line of representation" separates the cognitive work (below the line) from what is visible above the line.
  • Those working above the line continuously build and refresh their models of what lies below, which is critical for understanding system resilience and adaptive capacity.

A Tale of Two Incident Reviews

Incident Review 1 (Templated Approach)

  • This was a seemingly innocuous incident with little customer impact, handled with a templated approach.
  • The purpose was to file a report, not to engage in learning.
  • The incident review provided limited context, raising more questions than answers for a new team member.
  • The language used was often counterfactual, suggesting "should haves" rather than describing what actually happened.

Incident Review 2 (Cognitive Work Approach)

  • This review was conducted by an engineer outside the main team, providing a fresh perspective.
  • It focused on the cognitive work, examining how people communicated during the incident.
  • The review included evidence from previous deploys, involved teams not previously included, and aimed to engage the audience.
  • It leveraged various data sources, including chat logs, written communication, video recordings, and understanding the people involved.

Lessons Learned

  • The more fields in a template, the less people think about what really happened.
  • Focusing on the cognitive work and how people communicated provides valuable insights.
  • Involving a diverse set of stakeholders, including frontline support, dependent teams, future project managers, executives, and the wider organization, can lead to more learning opportunities.
  • Communication dissemination and knowledge sharing are crucial, as there is no single right way to do this.
  • Cognitive questioning and establishing rapport can help build trust and elicit more information from experts.
  • Continuous learning from incidents can lead to onboarding, professional development, meta-analysis, and identifying knowledge gaps.
  • Not every incident is worthy of a deep dive, but relating incidents over time can reveal valuable insights.

Circle CI's Journey

  • Circle CI's reliability journey, starting from writing monthly blog posts about their challenges to eventually seeing reliability as a competitive strength.
  • The formation of a "tiger team" to personally attend every incident and understand the underlying issues.
  • Experimentation with different incident management tools, ultimately leading to the integration of jelly (now part of PagerDuty) to improve coordination and communication.
  • The importance of extracting knowledge from experts to build up other team members and avoid single points of failure.
  • Leveraging automation and tooling to reduce the coordination and communication burden during incidents.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us