Scaling GPU infrastructure and using LLMs for Roblox's metaverse (GAM312)
GPU Scheduling on an EKS Cluster Environment
Introduction
The presenter, Denis Goupil, will discuss GPU scheduling on an EKS (Elastic Kubernetes Service) cluster environment.
The talk will cover the context of AI and Roblox, the problem they faced, and how they solved it.
Roblox and AI Platform Context
Roblox is a creation platform and an immersive 3D generation, where a community of creators create experiences for players.
Roblox's AI use cases include safety (text and voice filtering, bad detection, bot detection, abuse report), creator tools (doc and code assist, generative AI use cases), and player recommendations (homepage, friends, search, and discoveries).
Roblox's goal is to democratize AI by providing a "golden path" from idea to production, where ML engineers and data scientists can access GPUs in an interactive way (from notebooks and Visual Studio) and for long-running jobs (training, batch inference).
GPU Scheduling Issues
Kubernetes was never meant for AI workloads, where you want to pack workloads together, not spread them across multiple nodes.
Not all GPU types are equal, but Kubernetes treats them the same.
Teams work under budget and cost constraints, and cloud providers don't have infinite resources available anymore.
GPU Scheduling v1 (GPU as Pets)
Capacity reservation on AWS to ensure the availability of A100 and H100 GPUs.
Instance owner nodes where teams pay for and have guaranteed access to GPU resources.
Kubernetes scheduler changes to prioritize "Most Allocated" over spreading workloads across nodes.
Custom PostFiltering plugin for preemption to allow other teams to use free GPU resources.
Tiering of GPU requests (Tier-0, Tier-1, Tier-2) to prioritize paying customers and manage preemption.
Limitations of GPU Scheduling v1
Low efficiency due to capacity reservation and buffer for Tier-0 workloads.
Preemption can happen even when GPU resources are available.
Inability to bin-pack Tier-0 workloads efficiently.
High operational costs due to pending pods and teams requesting help.
GPU Scheduling v2 (GPU as Cattle)
Better monitoring to provide visibility into pending times, preemption, and tiering.
Moved away from instance owner nodes to use shared resources.
Quota as code with GPU type to manage resources dynamically.
Enforce good behavior, such as proportional CPU and memory requests.
Moved to YuniKorn for queue and scheduling, with custom webhook to manage annotations and policies.
Introduced "culling" of idle workloads (notebooks, batch, and training) to save costs.
Results of GPU Scheduling v2
Improved bin-packing efficiency, reducing the number of nodes needed to serve Tier-0 workloads.
Significant reduction in pending time, with Tier-1 and Tier-2 workloads taking up most of the GPU resources.
No more preemption due to the availability of free nodes for Tier-0 workloads.
Ability to fulfill new resource requests immediately, without having to wait for AWS to provision new machines.
Next Steps
Optimize capacity reservation by looking at the maximum usage instead of the sum of all requests.
Improve GPU efficiency by filling the gap between GPU availability, allocation, and utilization.
Implement dynamic rescheduling to bin-pack workloads in a more efficient, on-demand manner.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.