AI-Driven Operations: Scaling Observability and Cost Optimization
Overview
This presentation showcases two case studies on implementing resilience at scale and optimizing costs using AI. It covers the latest AWS services and features announced at re:Invent 2025, including advancements in AI infrastructure, observability tools, and cost management solutions.
AWS AI Infrastructure Updates
The latest generation of the Trainium chip, the Trainium 3, was introduced in a new Ultra Server with 144 chips.
NVIDIA's Grace Hopper 200 and 300 GPUs have been updated for the P3 EC2 instances.
SageMaker, AWS's end-to-end machine learning platform, now includes serverless model customization and checkpointless training.
Observability and Agent-Based AI
Bedrock is a tool for developing and managing AI agents, with updates to the agent core and the Strange SDK.
New agent-based AI capabilities were introduced, including:
Kiro Autonomous Agent for fully automated task completion
Security agent for real-time security checks on code updates
DevOps agent for monitoring, analysis, and automated remediation
Samsung Electronics' Case Study
Cost Optimization with AI
Samsung's MX division used AWS services since 2009, with costs and usage growing over 10% annually.
To address cost and stability challenges, they launched two projects:
Pinups: Defined a "Unicost" metric to track cost-per-transaction and applied AI-driven cost optimization.
AI Ops: Focused on operational stability, using AI for anomaly detection, cloud architecture review, and failure analysis.
Key results:
10.4% annual cost reduction for the Bixby service
Reduced time to analyze data from over a month to near real-time using a multi-agent AI architecture
AI-Driven Observability
Implemented a multi-agent architecture using the AWS Agent Core and Strange SDK to:
Collect data, arrange responses, and provide insights to users
Integrate with Amazon OpenSearch and Redshift for cost data
Demonstrated use cases for:
Identifying current monthly costs, resource usage patterns, and optimization opportunities
Comparing services to optimize RI/SP coverage
Providing recommendations for cost savings
Kakao's Case Study
Scaling Observability with AI
KakaoTalk has 49 million monthly active users, generating over 20TB of logs per day.
Challenges with existing log management solutions:
Elasticsearch scalability issues and high costs
Loki performance degradation with high-cardinality queries
Fragmented observability across multiple tools
Adopted a new strategy:
Unified log repository using ClickHouse, a column-oriented database
Leveraged ClickHouse's backup and restore capabilities for high availability
Utilized Graviton instances for cost-effective performance
Integrated AI-powered log analysis:
Analyzed service status and usage patterns from logs
Developed prompts to generate detailed business insights
Key Takeaways
Importance of clear leadership, data governance, and appropriate AI application
Benefits of integrating AI across the entire business process, not just for specific tasks
Significance of data quality and pipeline management for effective AI-driven operations
Need to balance latest technologies with simpler, cost-effective solutions based on data characteristics
Kakao's architecture: ClickHouse as unified log repository, Graviton instances, Open Telemetry standard
Business Impact
Samsung achieved 10.4% annual cost reduction for the Bixby service and near real-time data analysis.
Kakao improved operational efficiency, resource utilization, and query performance by transitioning to the new observability platform.
Both companies demonstrated the ability to leverage AI for cost optimization, operational stability, and business insights, leading to improved decision-making and competitiveness.
Examples
Samsung's "Unicost" metric to track cost-per-transaction and optimize costs
Kakao's use of ClickHouse's backup and restore capabilities for high availability
Demonstrations of AI-powered log analysis for service status, usage patterns, and business insights
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.