Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:
Varying Mental Models
- People can look at the same window but see different things - one might notice a beautiful tree, while another sees a cool car.
- Without each other's perspectives, the mental model is incomplete, which is often seen in incident reviews.
- The "line of representation" separates the cognitive work (below the line) from what is visible above the line.
- Those working above the line continuously build and refresh their models of what lies below, which is critical for understanding system resilience and adaptive capacity.
A Tale of Two Incident Reviews
Incident Review 1 (Templated Approach)
- This was a seemingly innocuous incident with little customer impact, handled with a templated approach.
- The purpose was to file a report, not to engage in learning.
- The incident review provided limited context, raising more questions than answers for a new team member.
- The language used was often counterfactual, suggesting "should haves" rather than describing what actually happened.
Incident Review 2 (Cognitive Work Approach)
- This review was conducted by an engineer outside the main team, providing a fresh perspective.
- It focused on the cognitive work, examining how people communicated during the incident.
- The review included evidence from previous deploys, involved teams not previously included, and aimed to engage the audience.
- It leveraged various data sources, including chat logs, written communication, video recordings, and understanding the people involved.
Lessons Learned
- The more fields in a template, the less people think about what really happened.
- Focusing on the cognitive work and how people communicated provides valuable insights.
- Involving a diverse set of stakeholders, including frontline support, dependent teams, future project managers, executives, and the wider organization, can lead to more learning opportunities.
- Communication dissemination and knowledge sharing are crucial, as there is no single right way to do this.
- Cognitive questioning and establishing rapport can help build trust and elicit more information from experts.
- Continuous learning from incidents can lead to onboarding, professional development, meta-analysis, and identifying knowledge gaps.
- Not every incident is worthy of a deep dive, but relating incidents over time can reveal valuable insights.
Circle CI's Journey
- Circle CI's reliability journey, starting from writing monthly blog posts about their challenges to eventually seeing reliability as a competitive strength.
- The formation of a "tiger team" to personally attend every incident and understand the underlying issues.
- Experimentation with different incident management tools, ultimately leading to the integration of jelly (now part of PagerDuty) to improve coordination and communication.
- The importance of extracting knowledge from experts to build up other team members and avoid single points of failure.
- Leveraging automation and tooling to reduce the coordination and communication burden during incidents.