TalksAWS re:Invent 2025 - Balance cost, performance & reliability for AI at enterprise scale (AIM3304)

AWS re:Invent 2025 - Balance cost, performance & reliability for AI at enterprise scale (AIM3304)

Balancing Cost, Performance, and Reliability for AI at Enterprise Scale

Overview of Inference Tiers

  • The presentation draws an analogy between different airline travel options and the inference tiers available in Amazon Bedrock:
    • Private Plane (Reserved Capacity): Dedicated capacity reserved for your exclusive use, providing the highest level of control and reliability.
    • First Class (Priority Tier): Prioritized access to inference requests, with a premium price for low-latency, high-reliability processing.
    • Economy Plus (Standard Tier): The default on-demand inference option that has been available in Bedrock, providing a balance of cost and performance.
    • Basic Economy (Flex Tier): A new discounted tier that trades off latency for cost savings, suitable for less time-sensitive workloads.

Balancing Cost, Latency, and Accuracy

  • Every production AI workload requires balancing three key factors:
    • Accuracy: The quality and correctness of the model's outputs.
    • Speed/Latency: The time it takes to complete an inference request.
    • Cost: The financial cost of running the inference workload.
  • The criticality of these factors is use-case dependent. Some workloads require the highest accuracy, while others prioritize speed or cost.
  • The new inference tiers provide flexibility to optimize for these tradeoffs on a per-request basis.

Use Case Examples

  • Latency-Sensitive Use Cases:
    • Flight booking during checkout
    • Gate change announcements
    • Mobile app check-in
  • Latency-Tolerant Use Cases:
    • Crew scheduling and assignments
    • Loyalty program mileage posting

Intuit's Experience

  • Intuit's "Intuit Genos" platform uses a model router to serve different AI-powered experiences with varying requirements.
  • For seasonal, high-throughput workloads (e.g., TurboTax during tax season), Intuit leverages the Reserved Capacity tier to guarantee performance and reliability.
  • For spiky, latency-sensitive daily traffic, Intuit uses the Priority tier to ensure low-latency processing without overpaying for reserved capacity.
  • For experimental, non-critical workloads, Intuit utilizes the Flex tier to optimize for cost while still meeting latency requirements.

Technical Deep Dive

  1. Standard Tier:

    • The default on-demand inference option, designed for day-to-day workloads.
    • Supports explicit prompt caching, providing both performance and cost benefits.
    • Allows for some retries and throttling, but with monitoring and alerting to maintain reliability.
  2. Reserved Capacity Tier:

    • Provides guaranteed, pre-purchased inference capacity for predictable, high-throughput workloads.
    • Offers flexible provisioning of input and output tokens to match the specific needs of the use case.
    • Supports explicit prompt caching with a different burndown rate to optimize costs.
    • Allows bursting to the Standard Tier when the reserved capacity is exceeded.
  3. Priority Tier:

    • Designed for latency-sensitive, spiky workloads that cannot tolerate retries or throttling.
    • Provides a premium, pay-as-you-go option with higher priority and faster processing.
    • Supports prompt caching with a discounted rate on the premium pricing.
    • Can offer better end-to-end latency by optimizing the fleet configuration.
  4. Flex Tier:

    • A new discounted tier for latency-tolerant, batch-oriented workloads.
    • Offers around a 50% discount compared to the Standard Tier.
    • Suitable for automated workflows, reporting, and other non-critical, batch-oriented use cases.
    • Also supports prompt caching with the same discounted rate as the tier's base pricing.
  5. Batch Inference:

    • Provides a dedicated option for bulk processing of large numbers of prompts.
    • Offers a 24-hour completion window with a 50% discount on token pricing.
    • Suitable for use cases like evaluations, reporting, and other repetitive, non-real-time workloads.

Key Takeaways

  • The new inference tiers in Amazon Bedrock provide flexibility to optimize for cost, latency, and accuracy on a per-request basis.
  • Intuit's experience demonstrates how different tiers can be leveraged to serve a variety of AI-powered use cases with varying requirements.
  • The technical details highlight the specific capabilities and characteristics of each tier, enabling enterprises to make informed decisions about their inference workloads.
  • Enterprises can now more effectively balance the tradeoffs between cost, performance, and reliability to power their mission-critical and experimental AI applications at scale.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.