TalksAWS re:Invent 2025 - Under the hood: Architecting Amazon EKS for scale and performance (CNS429)

AWS re:Invent 2025 - Under the hood: Architecting Amazon EKS for scale and performance (CNS429)

Architecting Amazon EKS for Scale and Performance

Kubernetes Adoption and the Need for Managed Services

  • Kubernetes has revolutionized application building and deployment, with 93% of companies either running it in production, evaluating, or piloting it.
  • The declarative nature of Kubernetes makes infrastructure management easier, making it the de facto standard for cloud-native environments.
  • However, managing Kubernetes at scale, with hundreds of clusters in production environments, poses a significant challenge.
  • Amazon EKS, a CNCF-certified and fully managed Kubernetes service, addresses this operational burden by uplifting it from customers, allowing them to focus on their business applications.

Amazon EKS: A Journey of Managed Kubernetes on AWS

  • Amazon EKS was launched in 2018 during re:Invent, and since then, it has introduced multiple features, including managed control plane capability, add-ons, IPv6 clusters, auto mode, and hybrid nodes.
  • One of the most notable achievements is the launch of the Carpenter project, a version of the cluster autoscaler, which has been donated back to the CNCF.

Powering AI/ML Workloads with Amazon EKS

  • Amazon EKS enables a diverse range of workloads, including web applications, data pipelines, and AI/ML workloads.
  • AI/ML workloads have unique characteristics, such as dependency on vast amounts of structured and unstructured data, high compute intensity, high-bandwidth low-latency networking, and parallel read/write storage solutions.
  • Gartner predicts that by 2028, 90-95% of new AI deployments will use Kubernetes, up from less than 30% today, highlighting the growing importance of Kubernetes and Amazon EKS for AI/ML workloads.

Why Customers Choose Amazon EKS for AI/ML

  1. Upstream Conformance and Tooling Ecosystem: EKS provides a vibrant open-source system with all the necessary tooling integrations.
  2. Unparalleled Customization Capabilities: Organizations have granular control over the underlying infrastructure, with access to a wide range of AWS EC2 instance types.
  3. Extensibility and Integrations: EKS offers integrations with a broad variety of AWS services, from compute to storage and networking, allowing customers to run workloads anywhere.
  4. High Scalability: EKS enables rapid scale-in and scale-out with governance control, and the Carpenter project supports cost optimization through efficient compute management.

Building the Right Foundation with Amazon EKS

Amazon EKS Control Plane Architecture

  • The control plane architecture includes API server instances across two availability zones and an etcd data store spread across three availability zones, with security built into the foundation.
  • EKS implements a deterministic resiliency approach, which ensures predictable outcomes even during failures, seamless API experience, and maintained API latencies.

Scaling the Amazon EKS Control Plane

  • EKS acts on multiple signals, including CPU, memory, node count, etcd database size, and API server performance metrics, to scale the control plane.
  • Recent improvements include parallel scaling of the API server and etcd, blue-green API server deployments, and intelligent and conservative global scaling using progressively larger instances and smart cooldown periods.

Amazon EKS Data Plane Options

  • EKS supports a range of data plane options, from self-managed node groups to fully managed auto mode node pools, as well as the Carpenter node pool and hybrid nodes for bringing your own infrastructure.
  • The Carpenter project, EKS's version of the cluster autoscaler, provides flexibility in choosing instance types, including on-demand and spot instances, and supports GPU and capacity reservations.

Accelerating Innovation with Amazon EKS

Recent Feature Enhancements

  • Parallel OCI pull: Accelerates image loading and unpacking using HTTP range methods.
  • Capacity block reservations: Supports the use of reserved instances.
  • Accelerated AMIs: Combines optimized drivers and runtime components for GPUs and AI accelerators, reducing setup time.
  • Kubernetes Device Plugin: Enables fine-grained sharing and allocation of GPU resources across multiple AI workloads and pods.
  • S3 Mountpoint Drivers: Provide direct access to training data and model artifacts stored in S3, reducing data loading time.

Day 2 Operations: Node Health and Auto Repair

  • EKS continuously monitors node health and automatically repairs unhealthy nodes, gracefully shutting down and evicting pods.
  • Container Insights provides comprehensive observability into EKS AI/ML workloads.

Architecting AI/ML Workloads on Amazon EKS

  • Customers leverage EKS to build end-to-end ML pipelines, including analysis, model development, training, evaluation, and deployment, integrating with various AWS services.
  • EKS supports multiple model frameworks and MLOps workflows, enabling model training, evaluation, and serving on both Nvidia GPUs and AWS Neuron architecture.

Introducing Amazon EKS Ultra Scale Clusters

Addressing Key Challenges of AI/ML Workloads

  • Massive coordinated compute required for training, needing low-latency and high-bandwidth coordination across thousands of instances.
  • Difficulty in managing frameworks and mapping across different clusters, leading to increased operational overhead.
  • Customers seek reduced operational overhead, simplified cluster management, and shared governance to improve cost efficiency and resource utilization.

Amazon EKS Ultra Scale Clusters

  • Amazon EKS Ultra Scale Clusters support up to 100,000 nodes in a single cluster, enabling the management of 800,000 Nvidia GPUs or 1.6 million AWS Trainium chips.
  • Key architectural innovations include:
    • Offloading consensus to a purpose-built multi-AZ transaction journal
    • Leveraging in-memory databases for higher read/write throughput
    • Intelligent partitioning of the etcd key-value store

Performance and Scalability of Amazon EKS Ultra Scale

  • Sustained high throughput for API requests (7,500 reads/sec, 8,000-9,000 writes/sec) and low latency (100ms to 1s at P99).
  • Ability to manage tens of millions of objects, including 8 million pods, 100,000 nodes, 6 million leases, and tens of millions of events.
  • Support for up to 20GB etcd database size, 2.5x the standard EKS clusters.

Introducing Amazon EKS Provisioned Control Plane

Motivations and Key Features

  • Allows customers to proactively select a performance tier that matches their business needs, instead of relying on reactive scaling.
  • Introduces three new performance tiers (XL, 2XL, 4XL) with significantly higher API request concurrency, pod scheduling rate, and database size compared to standard clusters.
  • Enables temporary scaling up to higher tiers for specific events or deployments, and scaling back down to optimize for both performance and cost.

Monitoring and Optimizing Provisioned Control Plane

  • Provides real-time metrics on API request concurrency, pod scheduling rate, and database size utilization to help customers monitor and optimize their control plane performance.
  • Allows customers to proactively scale up to higher tiers before critical events and scale back down when the demand subsides.

Customer Perspective: Anthropic's Experience with Amazon EKS

Anthropic's AI/ML Workloads and Kubernetes Architecture

  • Anthropic runs over 99% of its compute workloads on Amazon EKS, leveraging the platform for a wide range of AI/ML applications, including model training, inference, and robotics development.
  • Anthropic has implemented custom scheduling and resource management solutions, such as Cgrapher, to optimize the scheduling of large-scale batch workloads.
  • Anthropic has also optimized various components, including DNS, container image pulling, and storage, to address the unique challenges of running AI/ML workloads at scale on Kubernetes.

Future Improvements and Wishlist

  • Anthropic is excited about upcoming Kubernetes features, such as namespace controllers, multi-VPC architectures for ultra-scale clusters, Carpenter for capacity reservations, and the transition to IPv6 for a scalable, flat network.

Key Takeaways

  • Amazon EKS has become the trusted platform for running Kubernetes at scale, powering a diverse range of workloads, including complex AI/ML applications.
  • EKS provides a robust control plane architecture with deterministic resiliency and intelligent scaling capabilities to support high-performance and highly scalable Kubernetes environments.
  • The introduction of Amazon EKS Ultra Scale Clusters and Provisioned Control Plane features demonstrates AWS's commitment to addressing the unique challenges of running large-scale AI/ML workloads on Kubernetes.
  • Customers like Anthropic have successfully leveraged EKS to build scalable, cost-efficient, and operationally efficient AI/ML platforms, with plans to further optimize their Kubernetes architectures using upcoming Kubernetes features.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.