TalksBuilding production-grade resilient architectures with Amazon EKS (KUB404)
Building production-grade resilient architectures with Amazon EKS (KUB404)
Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:
Building Production-Grade, Resilient Architectures with Amazon EKS
Platform Engineering on EKS
Platforms are built by platform engineers to provide cloud infrastructure as a service for application teams.
Platform teams are organized by teams, applications, and infrastructure.
There is a growing trend in the adoption of EKS, with 33% year-over-year growth in the number of clusters being managed.
Cluster Lifecycle Management
Unmanaged growth of EKS clusters can lead to challenges:
Difficulty in enforcing standards across the fleet of clusters
Automation challenges
Lack of a single source of truth
Addon management
Workload matching and cost optimization
Cluster Management Patterns
Platform teams are shifting from providing templates as a service to offering more managed services:
Cluster-as-a-Service
Namespace-as-a-Service
Application Deployment-as-a-Service
GitOps-Driven Cluster Management
Using GitOps for cluster management provides benefits like reduced complexity, enhanced visibility, and increased security.
The cluster's bill of materials includes the control plane, worker nodes, and addons, all of which can be managed through GitOps.
Argo CD can be used as the GitOps agent to reconcile the desired state with the actual state of the cluster.
Cluster Resiliency and Upgrades
Upgrading clusters in batches requires safeguards to ensure resiliency and availability.
The EKS team uses a "cell" approach to upgrade clusters, where a "cell" represents a unit of work (e.g., a single cluster) that is upgraded in waves.
The time between waves (the "bake" or "soak" time) decreases as the number of cells increases, and different levels of testing are performed between waves.
This pattern can be applied to your own EKS clusters, with the GitOps-driven process used to orchestrate the rollout.
Observability
Roles and Responsibilities
Platform teams are responsible for keeping clusters up and running, providing a reliable service to application teams.
Observability strategies should include proactive alerting, runbooks, and feedback loops to enable the continuous delivery process.
Observability Challenges
Determining what to monitor and set alert thresholds for can be challenging, as the number of components and workloads in a cluster can be extensive.
Maintaining an aggregate view of all clusters, across accounts and regions, is important for managing the fleet at scale.
Cluster Inventory Management
Developer portals like Backstage can be used to provide a centralized view of all EKS clusters, including metadata, relationships, and deep links to other systems.
Governance
Ensuring Consistency at Scale
Policies as code engines like OPA, Gatekeeper, and Kube-bench can be used to enforce consistency and guardrails across the cluster fleet.
Policy Management Challenges
Keeping cluster upgrades on track by preventing deployments of deprecated APIs or resources that can block the rollout.
Ensuring application availability by enforcing things like pod disruption budgets.
Policy Management Patterns
Using a single Helm chart to deploy all policies, with the ability to enable/disable specific policies for different clusters or environments.
Handling exceptions by leveraging policy engine features like OPA's exceptions.
Aggregating policy violations using tools like Kube-bench's Policy Reporter, integrating with security services like AWS Security Hub.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.