TalksAWS re:Invent 2025 - AI-Driven Operations: Scaling Observability and Cost Optimization (GBL201)

AWS re:Invent 2025 - AI-Driven Operations: Scaling Observability and Cost Optimization (GBL201)

AI-Driven Operations: Scaling Observability and Cost Optimization

Overview

This presentation showcases two case studies on implementing resilience at scale and optimizing costs using AI. It covers the latest AWS services and features announced at re:Invent 2025, including advancements in AI infrastructure, observability tools, and cost management solutions.

AWS AI Infrastructure Updates

  • The latest generation of the Trainium chip, the Trainium 3, was introduced in a new Ultra Server with 144 chips.
  • NVIDIA's Grace Hopper 200 and 300 GPUs have been updated for the P3 EC2 instances.
  • SageMaker, AWS's end-to-end machine learning platform, now includes serverless model customization and checkpointless training.

Observability and Agent-Based AI

  • Bedrock is a tool for developing and managing AI agents, with updates to the agent core and the Strange SDK.
  • New agent-based AI capabilities were introduced, including:
    • Kiro Autonomous Agent for fully automated task completion
    • Security agent for real-time security checks on code updates
    • DevOps agent for monitoring, analysis, and automated remediation

Samsung Electronics' Case Study

Cost Optimization with AI

  • Samsung's MX division used AWS services since 2009, with costs and usage growing over 10% annually.
  • To address cost and stability challenges, they launched two projects:
    1. Pinups: Defined a "Unicost" metric to track cost-per-transaction and applied AI-driven cost optimization.
    2. AI Ops: Focused on operational stability, using AI for anomaly detection, cloud architecture review, and failure analysis.
  • Key results:
    • 10.4% annual cost reduction for the Bixby service
    • Reduced time to analyze data from over a month to near real-time using a multi-agent AI architecture

AI-Driven Observability

  • Implemented a multi-agent architecture using the AWS Agent Core and Strange SDK to:
    • Collect data, arrange responses, and provide insights to users
    • Integrate with Amazon OpenSearch and Redshift for cost data
  • Demonstrated use cases for:
    • Identifying current monthly costs, resource usage patterns, and optimization opportunities
    • Comparing services to optimize RI/SP coverage
    • Providing recommendations for cost savings

Kakao's Case Study

Scaling Observability with AI

  • KakaoTalk has 49 million monthly active users, generating over 20TB of logs per day.
  • Challenges with existing log management solutions:
    • Elasticsearch scalability issues and high costs
    • Loki performance degradation with high-cardinality queries
    • Fragmented observability across multiple tools
  • Adopted a new strategy:
    1. Unified log repository using ClickHouse, a column-oriented database
    2. Leveraged ClickHouse's backup and restore capabilities for high availability
    3. Utilized Graviton instances for cost-effective performance
  • Integrated AI-powered log analysis:
    • Analyzed service status and usage patterns from logs
    • Developed prompts to generate detailed business insights

Key Takeaways

  • Importance of clear leadership, data governance, and appropriate AI application
  • Benefits of integrating AI across the entire business process, not just for specific tasks
  • Significance of data quality and pipeline management for effective AI-driven operations
  • Need to balance latest technologies with simpler, cost-effective solutions based on data characteristics

Technical Details

  • AWS services and features: Trainium 3, Grace Hopper GPUs, SageMaker, Bedrock, Strange SDK, Amazon OpenSearch, Redshift
  • Samsung's architecture: Multi-agent design using AWS Agent Core, Amazon OpenSearch, Redshift
  • Kakao's architecture: ClickHouse as unified log repository, Graviton instances, Open Telemetry standard

Business Impact

  • Samsung achieved 10.4% annual cost reduction for the Bixby service and near real-time data analysis.
  • Kakao improved operational efficiency, resource utilization, and query performance by transitioning to the new observability platform.
  • Both companies demonstrated the ability to leverage AI for cost optimization, operational stability, and business insights, leading to improved decision-making and competitiveness.

Examples

  • Samsung's "Unicost" metric to track cost-per-transaction and optimize costs
  • Kakao's use of ClickHouse's backup and restore capabilities for high availability
  • Demonstrations of AI-powered log analysis for service status, usage patterns, and business insights

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.