Building production-grade resilient architectures with Amazon EKS (KUB404)

Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:

Building Production-Grade, Resilient Architectures with Amazon EKS

Platform Engineering on EKS

  • Platforms are built by platform engineers to provide cloud infrastructure as a service for application teams.
  • Platform teams are organized by teams, applications, and infrastructure.
  • There is a growing trend in the adoption of EKS, with 33% year-over-year growth in the number of clusters being managed.

Cluster Lifecycle Management

  • Unmanaged growth of EKS clusters can lead to challenges:
    • Difficulty in enforcing standards across the fleet of clusters
    • Automation challenges
    • Lack of a single source of truth
    • Addon management
    • Workload matching and cost optimization

Cluster Management Patterns

  • Platform teams are shifting from providing templates as a service to offering more managed services:
    • Cluster-as-a-Service
    • Namespace-as-a-Service
    • Application Deployment-as-a-Service

GitOps-Driven Cluster Management

  • Using GitOps for cluster management provides benefits like reduced complexity, enhanced visibility, and increased security.
  • The cluster's bill of materials includes the control plane, worker nodes, and addons, all of which can be managed through GitOps.
  • Argo CD can be used as the GitOps agent to reconcile the desired state with the actual state of the cluster.

Cluster Resiliency and Upgrades

  • Upgrading clusters in batches requires safeguards to ensure resiliency and availability.
  • The EKS team uses a "cell" approach to upgrade clusters, where a "cell" represents a unit of work (e.g., a single cluster) that is upgraded in waves.
  • The time between waves (the "bake" or "soak" time) decreases as the number of cells increases, and different levels of testing are performed between waves.
  • This pattern can be applied to your own EKS clusters, with the GitOps-driven process used to orchestrate the rollout.

Observability

Roles and Responsibilities

  • Platform teams are responsible for keeping clusters up and running, providing a reliable service to application teams.
  • Observability strategies should include proactive alerting, runbooks, and feedback loops to enable the continuous delivery process.

Observability Challenges

  • Determining what to monitor and set alert thresholds for can be challenging, as the number of components and workloads in a cluster can be extensive.
  • Maintaining an aggregate view of all clusters, across accounts and regions, is important for managing the fleet at scale.

Cluster Inventory Management

  • Developer portals like Backstage can be used to provide a centralized view of all EKS clusters, including metadata, relationships, and deep links to other systems.

Governance

Ensuring Consistency at Scale

  • Policies as code engines like OPA, Gatekeeper, and Kube-bench can be used to enforce consistency and guardrails across the cluster fleet.

Policy Management Challenges

  • Keeping cluster upgrades on track by preventing deployments of deprecated APIs or resources that can block the rollout.
  • Ensuring application availability by enforcing things like pod disruption budgets.

Policy Management Patterns

  • Using a single Helm chart to deploy all policies, with the ability to enable/disable specific policies for different clusters or environments.
  • Handling exceptions by leveraging policy engine features like OPA's exceptions.
  • Aggregating policy violations using tools like Kube-bench's Policy Reporter, integrating with security services like AWS Security Hub.

Additional Resources

  • EKS Workshop Hands-On Labs
  • EKS Best Practices Guide
  • EKS-related sessions at re:Invent 2023
  • GitHub repository with links to related resources

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us