Deep dive into Amazon ECS resilience and availability (SVS409)

Here is a detailed summary of the video transcription in markdown format, with the key takeaways broken down into sections for better readability:

Overview of Amazon ECS

  • Amazon ECS is a native container orchestration service on AWS that celebrated its 10th birthday this year.
  • ECS allows customers to provision workloads in containers on the AWS cloud using the same APIs, tools, and look and feel as other AWS services.
  • Customers can run containers on either Amazon EC2 instances or AWS Fargate, a serverless compute engine for containers.
  • ECS is a foundational service in AWS, deployed in 34 regions globally, and runs over 2.4 billion tasks per week.
  • More than 65% of AWS customers start their container journey with ECS, and over 70% of those choose to use Fargate.
  • ECS is not only used by customers but also powers many internal AWS services as a foundational building block.

Availability and Resilience in ECS

Foundational Building Blocks

  • AWS regions use a shared-nothing architecture, where each region is completely agnostic of the others.
  • Availability zones within a region are failure-isolated data centers, with redundant power and network.
  • These region and availability zone constructs are fundamental building blocks for building highly resilient services.

Availability Patterns

  • ECS control plane is deployed in every region, with instances in at least three availability zones.
  • ECS uses a pre-scaling approach, where services are scaled to 150% of peak demand across availability zones.
  • This ensures that if one availability zone fails, the other two can absorb the traffic without service interruption.
  • ECS also uses a cellular architecture, where the control plane is partitioned into multiple "cells" within a region, each managing a subset of clusters.
  • This contains the blast radius of any software issues or deployments, allowing for faster recovery and rollback.

Resilience through Automation

  • ECS uses rolling deployments, where changes are first deployed to a single cell and availability zone, building confidence before expanding.
  • ECS monitors deployments and will automatically fail back to the previous known-good version if issues are detected.
  • ECS also provides automated availability zone wayaway, where it can detect and automatically route traffic away from a problematic availability zone.
  • The AZ rebalance feature automatically rebalances tasks across availability zones if an imbalance is detected.
  • ECS supports local container restart, allowing containers to quickly restart on the same host in case of failures, without going through the full control plane.

Continuous Improvement through Chaos Engineering

  • ECS runs regular chaos experiments (called "game days") to test the resilience of the service.
  • The process involves:
    1. Preparation: Defining the experiment and expected outcomes.
    2. Detection: Verifying that the expected behaviors and signals were observed during the experiment.
    3. Response: Ensuring the right people and automations were engaged to resolve the issue.
    4. Learning: Conducting a postmortem to identify areas for improvement, both operationally and architecturally.
  • ECS uses a "Correction of Errors" (CoE) process to document and track learnings from these experiments and outages.
  • CoE focuses on identifying root causes and contributing factors, not on assigning blame.
  • The learnings from CoEs have directly influenced the roadmap and development of new ECS features, such as non-blocking I/O and automated availability zone wayaway.

Additional Resources

  • There are additional ECS sessions at re:Invent 2022 that attendees are encouraged to check out.
  • A QR code is provided that links to a landing page with the presentation deck, related resources, and a recording of the session.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us