TalksAWS re:Invent 2025 - Why is Reliability So Hard? (DVT227)

AWS re:Invent 2025 - Why is Reliability So Hard? (DVT227)

Summary of AWS re:Invent 2025 Presentation: "Why is Reliability So Hard?"

Introduction

  • Presenter: Hannes Lank, CEO and co-founder of Czechly
  • Czechly helps organizations detect, communicate, and resolve software reliability issues faster
  • Goal is to help engineers "own reliability from pull request to postmodern software"

The Evolution of Software Reliability

  • A decade ago, software was built and shipped much less frequently (yearly, quarterly, monthly)
  • Today, software is built and shipped almost instantly, but reliability has not kept pace
  • Yesterday's applications were simple, with few dependencies - issues were easy to identify
  • Modern applications are highly complex, with many dependencies that introduce potential failure points

The Reliability Challenge

  • Increased complexity and dependencies make it harder to ensure reliability
  • Traditional approaches of more people, processes, and testing have not solved the problem
  • Both development and operations teams try to validate application functionality, but in siloed ways

Key Principles for High-Performing Teams

  1. Predictability: Ability to predict how an application will behave when released to production
  2. Accountability: Knowing what changed, who changed it, why, and when - to identify the root cause of issues
  3. Resiliency: Building applications that can be quickly rolled back in the event of problems

Czechly's Approach

  • Unifies testing and monitoring into a single, version-controlled workflow
  • Allows teams to build tests (UI, API, uptime) as code and deploy them for continuous monitoring
  • Integrates the reliability pipeline with the CI/CD pipeline, enabling a common language and visibility

The Evolving "You" in Software Reliability

  • Traditionally, "you" referred to the developer or engineer responsible for the code
  • Today, "you" encompasses anyone who touches the user experience, including agents, cloud code, and other tools
  • In the future, agents may be capable of building, testing, monitoring, and owning more of the software lifecycle

Key Takeaways

  • Reliability has not kept pace with the rapid evolution of software development
  • Increased complexity and dependencies make it harder to ensure reliability using traditional approaches
  • High-performing teams focus on predictability, accountability, and resiliency to improve reliability
  • Czechly's approach unifies testing and monitoring, integrating the reliability pipeline with CI/CD
  • The concept of "you" in software reliability is expanding to include a wider range of stakeholders and tools

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.