Architecting Resilient Multicloud Operations: Lessons from Monzo Bank
Understanding Multicloud and Resilience
Multicloud refers to operating business applications in more than one cloud service provider (CSP)
Resilience is about preventing, mitigating, and recovering from failures as quickly as possible across the application stack
Resilience involves practices like high availability (HA) and disaster recovery (DR), as well as broader considerations around observability, automation, and alignment of people, processes, and technology
The SIMS Framework for Resilience
The SIMS framework outlines five key capabilities required for resilient workloads:
Redundancy: Avoiding single points of failure
Sufficient Capacity: Handling excessive load
Timely Output: Preventing excessive latency
Meaningful Behavior: Avoiding misconfiguration and bugs
Fault Isolation: Preventing shared fate across boundaries
Multicloud Resilience Best Practices
Leverage Fault Isolation Boundaries: CSPs provide inherent fault isolation, allowing failures to be contained within a single provider.
Implement a "Lifeboat" Strategy: Deploy a minimal critical functionality in a secondary CSP to act as a backup when the primary fails.
Understand Data Access Patterns: Evaluate how data is shared between the primary and secondary environments to manage load and latency.
Avoid Single Points of Failure: Carefully design communication, CI/CD, security, and network components to prevent single points of failure.
Test Extensively: Regularly test the system under load, measure latency, and validate behavior in both the primary and secondary environments.
Align People, Processes, and Technology: Ensure resilience is embedded across the software development lifecycle, with clear roles, responsibilities, and documented procedures.
Monzo Bank's Multicloud Resilience Strategy
Monzo, a digital-only bank, built a "Monzo Standin" platform in Google Cloud as a secondary environment to their primary AWS platform.
Key features of Monzo's approach:
Monzo Standin is a simplified version of the primary application, focused on critical functionality like payments and account management.
Data is continuously synchronized from the primary platform to Monzo Standin using an event-driven architecture.
Monzo can automatically or manually route traffic to Monzo Standin in the event of an outage in the primary platform.
Monzo Standin is tested daily by enrolling a subset of real customers to use the platform, and by running shadow testing to compare decisions between the two environments.
Monzo Standin can directly connect to payment networks if the primary platform is unavailable, avoiding a single point of failure.
Lessons and Considerations
Monzo's approach reduced the cost of maintaining Monzo Standin to only 1% of the primary platform, despite running it continuously.
The reduced complexity of Monzo Standin makes it easier to maintain and test, with fewer than 1% of changes made explicitly for the secondary platform.
This strategy is most effective for organizations that:
Operate both client-side and server-side components
Cannot tolerate any downtime for critical business functions
Can accept certain trade-offs, such as reduced functionality in the secondary environment
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.