Driving cost optimization at scale (MAM236)

Driving Cost Optimization at Scale with AWS Graviton

Key Takeaways

  1. Importance of Cost Optimization: Running cloud infrastructure can be expensive, and cost optimization is crucial for SaaS providers like Datadog to pass on savings to their customers.
  2. Datadog's Commitment to Cost Savings: Datadog tracks performance wins that have saved them $17 million over the past year, with savings ranging from $80,000 to $4 million.
  3. Leveraging AWS Graviton: Datadog found that ARM-based instances can use up to 60% less energy than similar EC2 instances, often providing better performance at a lower cost.
  4. Datadog's ARM Migration Program: Datadog's CTO challenged the company to migrate 100% of their AWS workloads to ARM by the end of 2023, with exceptions documented and approved.
  5. Key Lessons Learned:
    • Iterate early and often to refine the program and gather feedback.
    • Reduce friction for engineers by providing clear documentation and support.
    • Make the program's progress and impact as visible as possible across the organization.

Program Timeline

  1. February 2023: Datadog's CTO issued the challenge to migrate 100% of their AWS workloads to ARM by the end of 2023.
  2. March-April 2023: Datadog's program team gathered resources and built the necessary infrastructure (Confluence pages, dashboards, custom metrics, etc.).
  3. May 2023: The program officially launched, with the Fops team responsible for tracking the migration and building solutions.
  4. June 2023: Datadog encouraged engineers to prioritize the ARM migration as part of their Q3 OKRs.
  5. July 2023: Datadog's leadership team was engaged to help increase velocity and drive adoption towards the 100% goal.
  6. September 2023: Datadog reviewed the program's progress, finding a significant uptick in adoption that could be directly attributed to the program's efforts.
  7. December 2023: Datadog reached about 61% ARM adoption, with exceptions bringing the total to around 82%. The program was then transitioned to the Fops team for continued tracking and optimization.

Lessons Learned

  1. Iterate Early and Often: Datadog's program team refined their savings methodology and data tracking based on feedback from engineers and other platform teams.
  2. Reduce Friction: Datadog provided clear documentation, templates, and support to make the migration process as easy as possible for engineers.
  3. Make Progress Visible: Datadog built dashboards and metrics to track the program's progress, engaging leadership and engineers at all levels.

Next Steps

  1. Clear the Exception Backlog: Datadog will continue working to unblock the remaining exceptions, many of which are due to service deprecation.
  2. Continuously Optimize Instance Types: Datadog will help teams upgrade to newer, more cost-efficient Graviton instances over time.
  3. Maintain Visibility and Momentum: Datadog will keep the ARM migration process visible and continue supporting engineers in their efforts.

Resources

  • Datadog's State of Cloud Cost Report
  • Article: "Migrating Datadog's Kubernetes Fleet on AWS to ARM"
  • Past Talks:
    • Re:Invent 2021 - "Lessons Learned Migrating Datadog to ARM"
    • KubeCon 2022 - "Transitioning to ARM: Lessons Learned from Datadog's Journey"
    • KubeCon 2023 - "Driving Cost Optimization at Scale with ARM"

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us