Optimizing GPU and CPU utilization for cost savings and performance (COP360)

Highlights and Key Takeaways

Understanding User Behavior and Optimization Strategies

  • Casinos and other organizations use various techniques like colorful, zigzag carpets, lights, and sounds to keep users engaged and incentivized to spend more time and money within their ecosystem.
  • AWS and Capital One employ similar strategies to keep users engaged and motivated to optimize their cloud usage, but with the goal of driving efficiency and cost savings rather than maximizing spending.

Measuring and Optimizing Traditional Compute Workloads

  • Capital One uses tools like CoreMark to benchmark the performance of different EC2 instance types and make more informed instance selection decisions.
  • They found that larger instance sizes may not always provide the expected performance improvements due to factors like NUMA boundaries and physical hardware limitations.
  • Providing context around performance improvements with each instance generation helps developers make more informed decisions when selecting instance types.
  • Analyzing application-level details, such as the programming languages and libraries used, can uncover optimization opportunities beyond just the infrastructure layer.

Optimizing GPU-Powered Workloads

  • The paradigm for measuring efficiency shifts when dealing with GPU-powered workloads, as the GPU becomes the primary "workhorse" rather than the CPU.
  • Metrics like GPU temperature, power consumption, and utilization of streaming multiprocessors (CUDA cores) provide more meaningful insights into GPU performance and utilization.
  • Identifying mismatches between GPU utilization and power consumption can uncover opportunities to optimize the workload or the way it's architected.
  • Developing a deep understanding of the workloads and providing the right context and tooling to developers is key to driving efficient GPU usage.

Building Trust and Driving Adoption

  • Capital One's approach focuses on building trust with their internal development teams by providing the right data, context, and recommendations to enable them to make informed decisions.
  • Incentivizing and "gamifying" optimization efforts, such as using cupcakes as rewards, helps drive broader adoption and engagement.
  • Continuous improvement of their tooling and the ability to surface the right insights are crucial to maintaining a positive feedback loop with their users.

Conclusion

Capital One's finops team leverages a combination of benchmarking, data analysis, and user engagement strategies to drive efficiency and optimization across their traditional compute and GPU-powered workloads. By providing the right context and tools to their developers, they aim to build trust and enable informed decision-making, leading to more efficient and sustainable cloud usage.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us