TalksAWS re:Invent 2025 - Architecture lessons: Three failures and how to prevent them (DEV341)

AWS re:Invent 2025 - Architecture lessons: Three failures and how to prevent them (DEV341)

Architecture Lessons: Three Failures and How to Prevent Them

Failure 1: Lack of Resilience

  • The customer was a Brazilian e-commerce company that hosted mobile device sales for major brands like iPhone, Motorola, etc.
  • The initial architecture was a basic web tier application with a load balancer, EC2 instances, and a database (RDS or DynamoDB).
  • The key issues were:
    • Manual scaling: The team had to manually scale instances during peak seasons like Black Friday, leading to downtime.
    • Single availability zone: Most instances were hosted in a single availability zone, with only a few in another, increasing the risk of failure.
    • Primary database in the same availability zone: This increased latency between the web application and database.
    • Lack of caching: The application did not have any caching strategy, leading to high database load.

Lessons Learned:

  • Distribute instances across multiple availability zones for high availability.
  • Implement auto-scaling to handle traffic spikes automatically.
  • Leverage caching strategies (e.g., for images) to reduce database load.
  • Use a multi-region architecture to improve resilience and reduce latency.

Failure 2: Security and Operational Excellence Challenges

  • The customer had an enterprise-grade application running on a Kubernetes cluster.
  • They had enabled AWS Security Hub and were receiving alerts about version updates for their software.
  • However, the team was ignoring these alerts and suppressing the security warnings.

Lessons Learned:

  • Establish a clear process for evaluating and implementing security updates.
  • Ensure effective communication between security, development, and operations teams to address vulnerabilities promptly.
  • Automate the update process and validation to avoid manual errors and delays.
  • Implement governance and approval workflows to manage changes to critical production environments.

Failure 3: Multi-Account and Cost Efficiency Challenges

  • The customer had multiple AWS accounts, but they were not properly managed or optimized.
  • There was no clear policy or governance around instance sizing, shutdown schedules, or cost optimization.
  • This resulted in significant waste and unnecessary costs, estimated at around $60,000 per month.

Lessons Learned:

  • Establish a centralized cost management and optimization strategy across all AWS accounts.
  • Implement instance sizing guidelines and automated shutdown policies to reduce waste.
  • Develop a comprehensive governance framework to manage resources and changes across multiple accounts.
  • Leverage AWS tools and services (e.g., AWS Organizations, AWS Cost Explorer) to gain visibility and control over cloud costs.

Key Takeaways

  1. Resilience is an architectural decision, not just a checkbox. Distribute resources across multiple availability zones and regions, and implement auto-scaling and caching strategies to improve availability and performance.
  2. Security and operational excellence go hand-in-hand. Establish clear processes for evaluating and implementing updates, with effective communication and automation to avoid manual errors.
  3. Multi-account management requires a comprehensive governance framework. Centralize cost optimization, instance sizing, and shutdown policies to eliminate waste and improve efficiency.

By addressing these three key failures, organizations can build more robust, secure, and cost-effective cloud architectures that can withstand real-world challenges and deliver exceptional customer experiences.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.