Deep dive into Amazon ECS resilience and availability (SVS409)

Overview of Amazon ECS

Amazon ECS is a native container orchestration service on AWS that celebrated its 10th birthday this year.

ECS allows customers to provision workloads in containers on the AWS cloud using the same APIs, tools, and look and feel as other AWS services.

Customers can run containers on either Amazon EC2 instances or AWS Fargate, a serverless compute engine for containers.

ECS is a foundational service in AWS, deployed in 34 regions globally, and runs over 2.4 billion tasks per week.

More than 65% of AWS customers start their container journey with ECS, and over 70% of those choose to use Fargate.

ECS is not only used by customers but also powers many internal AWS services as a foundational building block.

Availability and Resilience in ECS

Foundational Building Blocks

AWS regions use a shared-nothing architecture, where each region is completely agnostic of the others.

Availability zones within a region are failure-isolated data centers, with redundant power and network.

These region and availability zone constructs are fundamental building blocks for building highly resilient services.

Availability Patterns

ECS control plane is deployed in every region, with instances in at least three availability zones.

ECS uses a pre-scaling approach, where services are scaled to 150% of peak demand across availability zones.

This ensures that if one availability zone fails, the other two can absorb the traffic without service interruption.

ECS also uses a cellular architecture, where the control plane is partitioned into multiple "cells" within a region, each managing a subset of clusters.

This contains the blast radius of any software issues or deployments, allowing for faster recovery and rollback.

Resilience through Automation

ECS uses rolling deployments, where changes are first deployed to a single cell and availability zone, building confidence before expanding.

ECS monitors deployments and will automatically fail back to the previous known-good version if issues are detected.

ECS also provides automated availability zone wayaway, where it can detect and automatically route traffic away from a problematic availability zone.

The AZ rebalance feature automatically rebalances tasks across availability zones if an imbalance is detected.

ECS supports local container restart, allowing containers to quickly restart on the same host in case of failures, without going through the full control plane.

Continuous Improvement through Chaos Engineering

ECS runs regular chaos experiments (called "game days") to test the resilience of the service.

The process involves:

Preparation: Defining the experiment and expected outcomes.
Detection: Verifying that the expected behaviors and signals were observed during the experiment.
Response: Ensuring the right people and automations were engaged to resolve the issue.
Learning: Conducting a postmortem to identify areas for improvement, both operationally and architecturally.

ECS uses a "Correction of Errors" (CoE) process to document and track learnings from these experiments and outages.

CoE focuses on identifying root causes and contributing factors, not on assigning blame.

The learnings from CoEs have directly influenced the roadmap and development of new ECS features, such as non-blocking I/O and automated availability zone wayaway.

Deep dive into Amazon ECS resilience and availability (SVS409)

Overview of Amazon ECS

Availability and Resilience in ECS

Foundational Building Blocks

Availability Patterns

Resilience through Automation

Continuous Improvement through Chaos Engineering

Additional Resources

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Deep dive into Amazon ECS resilience and availability (SVS409)

Overview of Amazon ECS

Availability and Resilience in ECS

Foundational Building Blocks

Availability Patterns

Resilience through Automation

Continuous Improvement through Chaos Engineering

Additional Resources

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.