TalksAWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Operating Apache Kafka and Apache Flink at Scale

Streaming Data and Real-Time Analytics

The Value of Streaming Data

  • Unlocking real-time value from data
  • Enabling continuous intelligence and faster decision-making
  • Providing more contextualized data for AI workloads

AWS's Streaming Services

  • AWS has offered managed Apache Kafka (Amazon MSK) and Apache Flink services for over 5 years
  • Customers use both self-managed and AWS-managed streaming solutions

Key Outcomes for Customers

  • Price performance
  • Reliability
  • Security
  • High performance

Operating Apache Kafka at Scale

Nexink Customer Example

  • Scaled from 200MB/s to 5GB/s of throughput on Amazon MSK
  • Learned valuable lessons about operating Kafka at scale

Spectrum of Kafka Management

  • Standard brokers: Fine-grained Kafka control for migrations
  • MSK Serverless: Zero Kafka management for new users
  • MSK Express Brokers: Balanced performance and elasticity

Storage Management Challenges

  • Storage scaling takes time, customers often don't monitor utilization
  • MSK Express eliminates storage management, provides unlimited, pay-as-you-go storage

Failure Handling and Recovery

  • Producers, consumers, replication, and partition compute all impact failure handling
  • Standard Kafka recovery is time-consuming and non-deterministic
  • MSK Express separates resources for up to 90% faster recovery

Horizontal Scaling and Rebalancing

  • Rebalancing partitions can take hours and impact throughput
  • MSK Express provides up to 20x more elasticity with intelligent rebalancing
  • 180x faster rebalancing compared to standard brokers

Resilience and Operational Awareness

  • Standard Kafka is an "allocate on demand" system, leading to overload and failures
  • MSK Express adds dynamic throttles and failure isolation to protect brokers
  • Automatic patching, partition fairness, and fault isolation improve resilience

Monitoring Recommendations

  • Key metrics to monitor: Partitions, connections, CPU, disk, memory, throughput
  • Set alerts to proactively manage Kafka clusters

Operating Apache Flink at Scale

Flink's Key Strengths

  • Real-time processing
  • Handling dynamic data sets
  • Stateful processing
  • Programmability

Flink Operating Model Changes

  • Shift from batch-based to continuous, event-driven processing
  • New concepts: Event time, windows, partitioning, exactly-once processing

Flink Architecture and Resilience

  • Job Manager coordinates, Task Managers execute
  • Checkpointing state to durable storage for fault tolerance

Managed Flink Service

  • Eliminates infrastructure setup and configuration
  • Provides built-in multi-AZ resilience

Minimizing Processing Interruptions

  • Blue-green deployments and warm pools reduce job downtime
  • Automated state restoration and smart guardrails for code changes

Shared Mental Models

  • Fixed units (KPUs) provide predictable performance
  • Monitoring job availability, not just infrastructure

Flink Code Best Practices

  • Proper use of event time and watermarks
  • Efficient state management and avoiding data skew
  • Optimizing serialization and schemas

Key Takeaways

  • Shared responsibility between AWS and customers for operating Kafka and Flink at scale
  • Leverage AWS's managed services to benefit from built-in best practices and resilience
  • Proactive monitoring and understanding failure modes are critical for running these systems
  • Optimizing application code is as important as managing the underlying infrastructure

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.