TalksAWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343)

AWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343)

Breaking AWS Networks on Purpose to Build Resilience

Introduction

  • Presenter is Craig Johnson, a Principal Solution Architect at Forward Networks
  • Focuses on introducing controlled chaos into AWS networks to build resilience, applying principles from on-premises data center networks

Mindset of a Network Engineer

  • Network is the "glue" that connects everything, so it's often blamed when issues arise
  • Need to be able to proactively disprove that the network is the problem

Gathering a Baseline

  • Importance of establishing a baseline before making network changes
  • Need to verify network intent and functionality, not just rely on application-level monitoring

Verifying Network Intent

  • Concept of "intent checks" to ensure changes don't break the network
  • Using AWS Network Manager to visualize the entire network topology, including on-premises components
  • Limitations of tools like flow logs and synthetic transactions for verifying network intent

Introducing Controlled Chaos

  • Purposefully breaking parts of the network to test resilience and validate intent
  • Examples include misconfiguring security groups, creating black hole routes, or modifying load balancer targets
  • Running pre-flight and post-change reachability checks using AWS Reachability Analyzer

Automating the Process

  • Integrating intent checks into a CI/CD pipeline using tools like Terraform or CloudFormation
  • Ensuring network changes never exit the change window with a broken network
  • Leveraging observability data to identify critical application flows and automate intent checks

Key Takeaways

  • Network changes are a leading cause of outages, so proactive validation is crucial
  • Reachability Analyzer is a powerful tool for verifying network intent and troubleshooting issues
  • Introducing controlled chaos helps build resilience and confidence in the network
  • Automating the process of intent checks ensures network changes don't break production

Technical Details

  • AWS Network Manager for visualizing the entire network topology
  • AWS Reachability Analyzer for verifying network intent and troubleshooting
  • Terraform and CloudFormation for automating the intent check process
  • Leveraging observability data (flow logs, synthetic transactions) to identify critical flows

Business Impact

  • Reduces the risk of network-related outages and service disruptions
  • Increases confidence in the network team's ability to make changes safely
  • Enables faster troubleshooting and root cause analysis when issues do occur
  • Aligns with DevSecOps principles of "shifting left" to catch issues earlier in the development lifecycle

Examples

  • Simulating a misconfigured VPC attachment and using Reachability Analyzer to identify the issue
  • Automating pre-and post-change intent checks as part of a CI/CD pipeline
  • Leveraging observability data to continuously update and refine the set of critical intent checks

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.