Control the cost of your generative AI services (COP203)

AI Cost Optimization Journey

Exploring the AI Cost Optimization Journey

  • Last year, there was a lot of AI buzz, and teams were asked to estimate the cost of building AI applications.
  • Now, the focus has shifted to optimizing the costs of the AI applications that have been built.
  • The speakers, Alex, Brent, and Adam, will discuss different approaches to building and optimizing AI applications.

Cosmo - The Dream Interpreting AI Companion

  • Cosmo is used as an example throughout the discussion to illustrate the different cost optimization strategies.
  • Cosmo has three key steps in its operation:
    1. Breaks down the request into tokens (bite-sized pieces of text)
    2. Accesses knowledge sources (e.g., dream guides, color science, castle architecture)
    3. Generates different outputs (e.g., quick response, blueprint, visual guide)

Key Metrics for AI Application Success

  1. Response Speed: How quickly Cosmo responds to a request.
  2. Resource Efficiency: The price tag or cost of running Cosmo.
  3. Reliability and Flexibility: Cosmo's availability and ability to be used in different regions.
  4. Quality and Depth: The quality and depth of Cosmo's responses.

Self-Managed AI Approach

Building Cosmo in a Self-Managed Environment

  • Components required: Accelerated compute (e.g., EC2 instances), bringing your own model, responsible for software and configurations, training, tuning, and inference.
  • Key considerations:
    1. Instance Selection: Choosing the right instance type (e.g., general-purpose, purpose-built) using tools like FMBench.
    2. Capacity Management: Determining the number of instances needed, utilizing on-demand capacity reservation (ODCR), and exploring spot instances.
    3. Commitments: Evaluating long-term (1-3 years) instance family commitments and using instance or compute savings plans.
    4. GPU Utilization: Monitoring and maximizing GPU utilization to get more performance and value from the same instances.

Lessons from Capital One's Journey

  • Brent shares Capital One's experience in building and optimizing AI applications in a self-managed environment:
    1. Instance cost and availability for accelerated instances are different from traditional instances.
    2. Interpreting cost and performance for accelerated instances requires looking at GPU-specific metrics (utilization, wattage, thermals).
    3. External factors, such as model type and architecture, can significantly impact instance performance.

Partly Managed AI Approach (Amazon SageMaker)

Optimizing AI Applications with Amazon SageMaker

  • Amazon SageMaker handles the infrastructure (tools, workflows, software configurations) so that you can focus on building and delivering business value.
  • Key optimization considerations:
    1. Instance Selection: Choosing the right instance type based on the workload (e.g., notebooks, training).
    2. Model Selection: Selecting the appropriate model for the problem, considering factors like data, resources required, and cost.
    3. Commitments: Leveraging Amazon SageMaker commitments to get up to 64% cost reduction.
    4. Spot Instances: Utilizing spot instances for training to save up to 90%.
    5. Inference: Choosing the appropriate inference type (real-time, serverless, asynchronous, batch) based on the use case and cost requirements.

Fully Managed AI Approach (Amazon Bedrock)

Optimizing AI Applications with Amazon Bedrock

  • Amazon Bedrock offers high-performing foundation models through a single API, allowing you to focus more on the application and less on infrastructure management.
  • Key optimization considerations:
    1. Pricing Model: Choosing between on-demand, provisioned, and batch pricing options based on the predictability of workload.
    2. Model Selection: Considering the cost, speed, and accuracy of different models, and testing multiple models to find the best fit.
    3. Knowledge Bases: Carefully managing the quantity and frequency of data used to augment the model's knowledge base.
    4. Fine-Tuning and Model Distillation: Leveraging Bedrock's features to fine-tune models and create smaller, more cost-effective models.
    5. Application Layer: Implementing strategies like prompt caching and pre/post-processing to optimize costs.

Supporting Services

  • Considerations for supporting services, such as storage (S3), vector databases (OpenSearch), data transfer, and analytics, can also impact the overall cost optimization of the AI application.

Summary

  • The speakers have provided a comprehensive overview of different approaches to building and optimizing AI applications, using the Cosmo example to illustrate the key concepts.
  • Attendees are encouraged to fill out the session survey and apply the learnings to their own AI cost optimization journey.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.