Talks AWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343) VIDEO
AWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343) Breaking AWS Networks on Purpose to Build Resilience
Introduction
Presenter is Craig Johnson, a Principal Solution Architect at Forward Networks
Focuses on introducing controlled chaos into AWS networks to build resilience, applying principles from on-premises data center networks
Mindset of a Network Engineer
Network is the "glue" that connects everything, so it's often blamed when issues arise
Need to be able to proactively disprove that the network is the problem
Gathering a Baseline
Importance of establishing a baseline before making network changes
Need to verify network intent and functionality, not just rely on application-level monitoring
Verifying Network Intent
Concept of "intent checks" to ensure changes don't break the network
Using AWS Network Manager to visualize the entire network topology, including on-premises components
Limitations of tools like flow logs and synthetic transactions for verifying network intent
Introducing Controlled Chaos
Purposefully breaking parts of the network to test resilience and validate intent
Examples include misconfiguring security groups, creating black hole routes, or modifying load balancer targets
Running pre-flight and post-change reachability checks using AWS Reachability Analyzer
Automating the Process
Integrating intent checks into a CI/CD pipeline using tools like Terraform or CloudFormation
Ensuring network changes never exit the change window with a broken network
Leveraging observability data to identify critical application flows and automate intent checks
Key Takeaways
Network changes are a leading cause of outages, so proactive validation is crucial
Reachability Analyzer is a powerful tool for verifying network intent and troubleshooting issues
Introducing controlled chaos helps build resilience and confidence in the network
Automating the process of intent checks ensures network changes don't break production
Technical Details
AWS Network Manager for visualizing the entire network topology
AWS Reachability Analyzer for verifying network intent and troubleshooting
Terraform and CloudFormation for automating the intent check process
Leveraging observability data (flow logs, synthetic transactions) to identify critical flows
Business Impact
Reduces the risk of network-related outages and service disruptions
Increases confidence in the network team's ability to make changes safely
Enables faster troubleshooting and root cause analysis when issues do occur
Aligns with DevSecOps principles of "shifting left" to catch issues earlier in the development lifecycle
Examples
Simulating a misconfigured VPC attachment and using Reachability Analyzer to identify the issue
Automating pre-and post-change intent checks as part of a CI/CD pipeline
Leveraging observability data to continuously update and refine the set of critical intent checks
Your Digital Journey deserves a great story. Build one with us.