Best practices for creating multi-Region architectures on AWS (ARC323)

Establishing a Robust Multi-Region Architecture with Failover Strategies

The Fundamentals of Multi-Region Architecture

  1. Understanding the Requirements:

    • Define and bound the recovery time objectives (RTOs) to set appropriate expectations for stakeholders and design the architecture accordingly.
    • Measure and track recovery times accurately, considering the entire process from problem detection to full resolution.
    • Common availability metrics include Mean Time to Detection (MTTD), Mean Time to Recovery (MTTR), and Mean Time Between Failures (MTBF).
  2. Understanding the Data:

    • Identify where the data is stored, how it is replicated (synchronously or asynchronously), and the consistency level.
    • Determine the acceptable data loss (recovery point objective) and ensure the replication delay meets this requirement.
    • Evaluate the recovery mechanisms (e.g., backup and restore vs. active replication) to meet the recovery time objectives.
  3. Understanding Dependencies:

    • Ensure that each region can operate independently without dependencies on other regions.
    • Recovery actions should be taken from the recovery region, without relying on the potentially impaired workload.
    • Map out critical user stories and their dependencies to prioritize the architecture changes.
  4. Operational Readiness:

    • Test the failover process regularly to identify improvements, such as scaling characteristics, hidden dependencies, and configuration drift.
    • Implement differential observability to monitor the application from multiple perspectives, including user experience and internal metrics.

Organizational Failover Strategies

  1. Component-level Failover:

    • Allows failing over individual application components to a different region.
    • Provides flexibility but introduces complexity and potential performance trade-offs.
  2. Application Failover:

    • Allows each application to failover all its components together to a secondary region.
    • Reduces complexity but can still introduce varying performance behavior.
  3. Dependency-based Failover:

    • Keeps interconnected components that support a user story together during failover.
    • Reduces latency and complexity but requires building and managing dependency graphs.
  4. Portfolio Failover:

    • Fails over all applications in the portfolio simultaneously, regardless of direct impact.
    • Simplifies the operational burden but requires a significant investment to enable multi-region capabilities.

Determining When to Failover

  • Develop a pre-planned failover decision tree with clear, unambiguous metrics and thresholds.
  • Integrate the decision tree into the overall operational continuity strategies, involving cross-functional stakeholders.
  • Handle scenarios with missing information, such as when the decision-maker is unavailable or when key metrics are not available.

Recovery Best Practices

  1. Declare a Disaster:

    • Clearly outline the situation, action, and owner for each potential disaster scenario in the runbook.
    • Establish communication protocols for employees during a disaster.
  2. Execution of Failover:

    • Ensure detailed, repeatable procedures in the runbook, including scripts, expected outcomes, and responsible parties.
    • Maintain necessary credentials and permissions in both primary and secondary regions.

Samsung Account's Multi-Region Journey

  • Established a global disaster recovery architecture with three AWS regions (EU, US, and AP).
  • Implemented a microservice architecture with over 70 microservices running on Amazon EKS.
  • Utilized Aurora DB, DynamoDB, and ElastiCache for data and caching support.
  • Leveraged CloudFront for web and API performance improvement.
  • Adopted a DNS-based traffic routing control method, combined with CloudFront Edge Locations, to enable efficient failover.
  • Implemented comprehensive monitoring, including service-level and end-to-end user experience monitoring, to enhance observability and fault detection.
  • Established a three-tier operational structure with fast decision-making and remediation in the event of failures.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us