AWS re:Invent 2025 - Transforming Apache Kafka into a Scalable Message Queue (OPN413)

Transforming Apache Kafka into a Scalable Message Queue

Overview

The presentation discusses how the open-source community has extended Apache Kafka to support queue-like capabilities, addressing the limitations of Kafka's native streaming model.

Key challenges addressed include handling spikes in ingress traffic, scaling outbound consumption, head-of-line blocking, and the need for queue-like patterns beyond Kafka's default design.

Kafka Limitations and Motivation for Change

Kafka's native model is optimized for high-throughput event streaming, but many use cases require patterns for independent work item processing, workload distribution, and application decoupling.

Challenges include:

Handling spikes in ingress traffic that outpace consumer processing, leading to rising consumer lag
Scaling outbound consumption to match high ingress throughput
Head-of-line blocking due to Kafka's ordering guarantees within partitions
The need to build custom frameworks to force Kafka to behave like a queue

Introducing Queues for Kafka (KIP-932)

KIP-932 introduces "share groups," a new type of consumer group that allows multiple consumers to read data from the same partition in parallel.

Key capabilities:

Parallel consumption without the need for additional partitions
Fine-grained control over message processing, including at-least-once delivery and retries
Elimination of head-of-line blocking by removing the requirement for strict in-order processing

How Share Groups Work

Share groups enable multiple consumers to read from the same partition concurrently, improving parallelism compared to the traditional consumer group model.

Each partition consumed by a share group is called a "share partition," with a "share partition start offset" and "share partition end offset" to manage the in-flight records.

The share partition leader controls the number of in-flight records to manage concurrency and prevent overload.

Records go through different states: archived, acquired, available, and acknowledged.

Ordering is maintained within each consumer's batch, but not across multiple batches, prioritizing scalability over strict sequencing.

Handling Bad Records and Offsets

Share groups can deal with bad records to prevent consumer groups from stalling:

Transient errors: Consumer can acknowledge and release the record or reject it.
Deserialization errors: Kafka automatically releases the record and allows processing to continue.
Corrupt record batches: Kafka rejects the entire batch to prevent other consumers from picking it up.

Offsets can have gaps due to compaction or log retention, but consumers can handle these gracefully.

Use Cases for Queues in Kafka

Task Worker Processing: Decoupling the upload latency from heavy message processing, enabling horizontal scaling of worker pools.

Microservice Job Queues: Handling asynchronous workloads like email sending, PDF generation, and billing without strict ordering requirements.

Asynchronous Request Execution: Running operations asynchronously to keep the application responsive, such as fraud detection during checkout.

Technical Implementation Details

Share groups introduce a new internal topic called the "share group offset topic" for managing state.

The "share partition leader" is a new component that resides on the same broker as the topic partition leader, managing the share group state and record delivery.

The share partition leader handles record locking, automatic release upon timeout, and retries for failed records.

Broker and client-side metrics are available to monitor the health and performance of share groups.

Limitations and Roadmap

The share group capability is currently in preview in Kafka 4.1 and not recommended for production use yet.

Limitations include lack of dead-letter queue support, inability to monitor share group lag, and unsuitability for workloads requiring strict ordering and exactly-once processing.

Upcoming roadmap items include making the feature production-ready, adding dead-letter queue support, and improving monitoring and configuration options.

Key Takeaways

KIP-932 introduces "queues for Kafka," enabling parallel consumption and queue-like patterns within the Kafka ecosystem.

This expands Kafka's capabilities beyond its traditional streaming use cases, addressing common challenges around scaling, head-of-line blocking, and the need for queue-like semantics.

Share groups provide a Kafka-native way to handle spikes in ingress traffic, scale outbound consumption, and process asynchronous workloads more efficiently.

The feature is currently in preview, with plans to make it production-ready and address remaining limitations in future Kafka releases.

AWS re:Invent 2025 - Transforming Apache Kafka into a Scalable Message Queue (OPN413)

Transforming Apache Kafka into a Scalable Message Queue

Overview

Kafka Limitations and Motivation for Change

Introducing Queues for Kafka (KIP-932)

How Share Groups Work

Handling Bad Records and Offsets

Use Cases for Queues in Kafka

Technical Implementation Details

Limitations and Roadmap

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Transforming Apache Kafka into a Scalable Message Queue (OPN413)

Transforming Apache Kafka into a Scalable Message Queue

Overview

Kafka Limitations and Motivation for Change

Introducing Queues for Kafka (KIP-932)

How Share Groups Work

Handling Bad Records and Offsets

Use Cases for Queues in Kafka

Technical Implementation Details

Limitations and Roadmap

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.