TalksAWS re:Invent 2025 - Transforming Apache Kafka into a Scalable Message Queue (OPN413)

AWS re:Invent 2025 - Transforming Apache Kafka into a Scalable Message Queue (OPN413)

Transforming Apache Kafka into a Scalable Message Queue

Overview

  • The presentation discusses how the open-source community has extended Apache Kafka to support queue-like capabilities, addressing the limitations of Kafka's native streaming model.
  • Key challenges addressed include handling spikes in ingress traffic, scaling outbound consumption, head-of-line blocking, and the need for queue-like patterns beyond Kafka's default design.

Kafka Limitations and Motivation for Change

  • Kafka's native model is optimized for high-throughput event streaming, but many use cases require patterns for independent work item processing, workload distribution, and application decoupling.
  • Challenges include:
    • Handling spikes in ingress traffic that outpace consumer processing, leading to rising consumer lag
    • Scaling outbound consumption to match high ingress throughput
    • Head-of-line blocking due to Kafka's ordering guarantees within partitions
    • The need to build custom frameworks to force Kafka to behave like a queue

Introducing Queues for Kafka (KIP-932)

  • KIP-932 introduces "share groups," a new type of consumer group that allows multiple consumers to read data from the same partition in parallel.
  • Key capabilities:
    • Parallel consumption without the need for additional partitions
    • Fine-grained control over message processing, including at-least-once delivery and retries
    • Elimination of head-of-line blocking by removing the requirement for strict in-order processing

How Share Groups Work

  • Share groups enable multiple consumers to read from the same partition concurrently, improving parallelism compared to the traditional consumer group model.
  • Each partition consumed by a share group is called a "share partition," with a "share partition start offset" and "share partition end offset" to manage the in-flight records.
  • The share partition leader controls the number of in-flight records to manage concurrency and prevent overload.
  • Records go through different states: archived, acquired, available, and acknowledged.
  • Ordering is maintained within each consumer's batch, but not across multiple batches, prioritizing scalability over strict sequencing.

Handling Bad Records and Offsets

  • Share groups can deal with bad records to prevent consumer groups from stalling:
    • Transient errors: Consumer can acknowledge and release the record or reject it.
    • Deserialization errors: Kafka automatically releases the record and allows processing to continue.
    • Corrupt record batches: Kafka rejects the entire batch to prevent other consumers from picking it up.
  • Offsets can have gaps due to compaction or log retention, but consumers can handle these gracefully.

Use Cases for Queues in Kafka

  1. Task Worker Processing: Decoupling the upload latency from heavy message processing, enabling horizontal scaling of worker pools.
  2. Microservice Job Queues: Handling asynchronous workloads like email sending, PDF generation, and billing without strict ordering requirements.
  3. Asynchronous Request Execution: Running operations asynchronously to keep the application responsive, such as fraud detection during checkout.

Technical Implementation Details

  • Share groups introduce a new internal topic called the "share group offset topic" for managing state.
  • The "share partition leader" is a new component that resides on the same broker as the topic partition leader, managing the share group state and record delivery.
  • The share partition leader handles record locking, automatic release upon timeout, and retries for failed records.
  • Broker and client-side metrics are available to monitor the health and performance of share groups.

Limitations and Roadmap

  • The share group capability is currently in preview in Kafka 4.1 and not recommended for production use yet.
  • Limitations include lack of dead-letter queue support, inability to monitor share group lag, and unsuitability for workloads requiring strict ordering and exactly-once processing.
  • Upcoming roadmap items include making the feature production-ready, adding dead-letter queue support, and improving monitoring and configuration options.

Key Takeaways

  • KIP-932 introduces "queues for Kafka," enabling parallel consumption and queue-like patterns within the Kafka ecosystem.
  • This expands Kafka's capabilities beyond its traditional streaming use cases, addressing common challenges around scaling, head-of-line blocking, and the need for queue-like semantics.
  • Share groups provide a Kafka-native way to handle spikes in ingress traffic, scale outbound consumption, and process asynchronous workloads more efficiently.
  • The feature is currently in preview, with plans to make it production-ready and address remaining limitations in future Kafka releases.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.