Talks AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307) VIDEO
AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307) Operating Apache Kafka and Apache Flink at Scale
Streaming Data and Real-Time Analytics
The Value of Streaming Data
Unlocking real-time value from data
Enabling continuous intelligence and faster decision-making
Providing more contextualized data for AI workloads
AWS's Streaming Services
AWS has offered managed Apache Kafka (Amazon MSK) and Apache Flink services for over 5 years
Customers use both self-managed and AWS-managed streaming solutions
Key Outcomes for Customers
Price performance
Reliability
Security
High performance
Operating Apache Kafka at Scale
Nexink Customer Example
Scaled from 200MB/s to 5GB/s of throughput on Amazon MSK
Learned valuable lessons about operating Kafka at scale
Spectrum of Kafka Management
Standard brokers: Fine-grained Kafka control for migrations
MSK Serverless: Zero Kafka management for new users
MSK Express Brokers: Balanced performance and elasticity
Storage Management Challenges
Storage scaling takes time, customers often don't monitor utilization
MSK Express eliminates storage management, provides unlimited, pay-as-you-go storage
Failure Handling and Recovery
Producers, consumers, replication, and partition compute all impact failure handling
Standard Kafka recovery is time-consuming and non-deterministic
MSK Express separates resources for up to 90% faster recovery
Horizontal Scaling and Rebalancing
Rebalancing partitions can take hours and impact throughput
MSK Express provides up to 20x more elasticity with intelligent rebalancing
180x faster rebalancing compared to standard brokers
Resilience and Operational Awareness
Standard Kafka is an "allocate on demand" system, leading to overload and failures
MSK Express adds dynamic throttles and failure isolation to protect brokers
Automatic patching, partition fairness, and fault isolation improve resilience
Monitoring Recommendations
Key metrics to monitor: Partitions, connections, CPU, disk, memory, throughput
Set alerts to proactively manage Kafka clusters
Operating Apache Flink at Scale
Flink's Key Strengths
Real-time processing
Handling dynamic data sets
Stateful processing
Programmability
Flink Operating Model Changes
Shift from batch-based to continuous, event-driven processing
New concepts: Event time, windows, partitioning, exactly-once processing
Flink Architecture and Resilience
Job Manager coordinates, Task Managers execute
Checkpointing state to durable storage for fault tolerance
Managed Flink Service
Eliminates infrastructure setup and configuration
Provides built-in multi-AZ resilience
Minimizing Processing Interruptions
Blue-green deployments and warm pools reduce job downtime
Automated state restoration and smart guardrails for code changes
Shared Mental Models
Fixed units (KPUs) provide predictable performance
Monitoring job availability, not just infrastructure
Flink Code Best Practices
Proper use of event time and watermarks
Efficient state management and avoiding data skew
Optimizing serialization and schemas
Key Takeaways
Shared responsibility between AWS and customers for operating Kafka and Flink at scale
Leverage AWS's managed services to benefit from built-in best practices and resilience
Proactive monitoring and understanding failure modes are critical for running these systems
Optimizing application code is as important as managing the underlying infrastructure
Your Digital Journey deserves a great story. Build one with us.