TalksAWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Operating Apache Kafka and Apache Flink at Scale

Streaming Data and Real-Time Analytics

The Value of Streaming Data

Unlocking real-time value from data
Enabling continuous intelligence and faster decision-making
Providing more contextualized data for AI workloads

AWS's Streaming Services

AWS has offered managed Apache Kafka (Amazon MSK) and Apache Flink services for over 5 years
Customers use both self-managed and AWS-managed streaming solutions

Key Outcomes for Customers

Price performance
Reliability
Security
High performance

Operating Apache Kafka at Scale

Nexink Customer Example

Scaled from 200MB/s to 5GB/s of throughput on Amazon MSK
Learned valuable lessons about operating Kafka at scale

Spectrum of Kafka Management

Standard brokers: Fine-grained Kafka control for migrations
MSK Serverless: Zero Kafka management for new users
MSK Express Brokers: Balanced performance and elasticity

Storage Management Challenges

Storage scaling takes time, customers often don't monitor utilization
MSK Express eliminates storage management, provides unlimited, pay-as-you-go storage

Failure Handling and Recovery

Producers, consumers, replication, and partition compute all impact failure handling
Standard Kafka recovery is time-consuming and non-deterministic
MSK Express separates resources for up to 90% faster recovery

Horizontal Scaling and Rebalancing

Rebalancing partitions can take hours and impact throughput
MSK Express provides up to 20x more elasticity with intelligent rebalancing
180x faster rebalancing compared to standard brokers

Resilience and Operational Awareness

Standard Kafka is an "allocate on demand" system, leading to overload and failures
MSK Express adds dynamic throttles and failure isolation to protect brokers
Automatic patching, partition fairness, and fault isolation improve resilience

Monitoring Recommendations

Key metrics to monitor: Partitions, connections, CPU, disk, memory, throughput
Set alerts to proactively manage Kafka clusters

Operating Apache Flink at Scale

Flink's Key Strengths

Real-time processing
Handling dynamic data sets
Stateful processing
Programmability

Flink Operating Model Changes

Shift from batch-based to continuous, event-driven processing
New concepts: Event time, windows, partitioning, exactly-once processing

Flink Architecture and Resilience

Job Manager coordinates, Task Managers execute
Checkpointing state to durable storage for fault tolerance

Managed Flink Service

Eliminates infrastructure setup and configuration
Provides built-in multi-AZ resilience

Minimizing Processing Interruptions

Blue-green deployments and warm pools reduce job downtime
Automated state restoration and smart guardrails for code changes

Shared Mental Models

Fixed units (KPUs) provide predictable performance
Monitoring job availability, not just infrastructure

Flink Code Best Practices

Proper use of event time and watermarks
Efficient state management and avoiding data skew
Optimizing serialization and schemas

Key Takeaways

Shared responsibility between AWS and customers for operating Kafka and Flink at scale
Leverage AWS's managed services to benefit from built-in best practices and resilience
Proactive monitoring and understanding failure modes are critical for running these systems
Optimizing application code is as important as managing the underlying infrastructure

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Operating Apache Kafka and Apache Flink at Scale

Streaming Data and Real-Time Analytics

The Value of Streaming Data

AWS's Streaming Services

Key Outcomes for Customers

Operating Apache Kafka at Scale

Nexink Customer Example

Spectrum of Kafka Management

Storage Management Challenges

Failure Handling and Recovery

Horizontal Scaling and Rebalancing

Resilience and Operational Awareness

Monitoring Recommendations

Operating Apache Flink at Scale

Flink's Key Strengths

Flink Operating Model Changes

Flink Architecture and Resilience

Managed Flink Service

Minimizing Processing Interruptions

Shared Mental Models

Flink Code Best Practices

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Operating Apache Kafka and Apache Flink at Scale

Streaming Data and Real-Time Analytics

The Value of Streaming Data

AWS's Streaming Services

Key Outcomes for Customers

Operating Apache Kafka at Scale

Nexink Customer Example

Spectrum of Kafka Management

Storage Management Challenges

Failure Handling and Recovery

Horizontal Scaling and Rebalancing

Resilience and Operational Awareness

Monitoring Recommendations

Operating Apache Flink at Scale

Flink's Key Strengths

Flink Operating Model Changes

Flink Architecture and Resilience

Managed Flink Service

Minimizing Processing Interruptions

Shared Mental Models

Flink Code Best Practices

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.