Apache Kafka is a powerful platform for creating distributed event streams - it allows you to read, write, store and process events at scale. However, the extent to which Kafka can do these things well depends on how you configure your Kafka cluster. Kafka has a complex architecture, and the efficiency of Kafka data streams can vary significantly based on how individual components of Kafka are set up, as well as how they integrate with each other.

To provide guidance on getting the best performance from Kafka, this article explains best practices for configuring one key Kafka component: Consumers. Below, you'll learn where Consumers fit within the Kafka architecture, how to set up a Consumer and which Consumer best practices can help to supercharge Kafka clusters.

What is Kafka? Kafka cluster components, explained

Before diving into best practices for working with Kafka Consumers, let's discuss the overall Kafka architecture and the main components of a Kafka cluster.

Kafka, as we've said, is a distributed event-streaming platform. Its purpose is to stream data in real time – or as close to real time as possible – so that applications and services that need to send and receive data can do so in an efficient, reliable, continuous and scalable way.

To enable streaming data, Kafka relies on several distinct components:

  • Producers: A Producer is an application or other resource that generates data. In other words, Producers are the source of Kafka data streams.
  • Brokers: A Broker is a server that runs within a Kafka cluster. Its job is to store Kafka Topics, as well as to save the state of each Consumer that consumes those topics. Kafka's distributed architecture means that you can have multiple Brokers – which is a great thing, because the more Brokers you have, the more you can distribute data streaming tasks across them in order to optimize performance and reliability.
  • Topics: Topics are categories of messages. You can organize data streams into Topics to keep track of different types of messages. Topics can be partitioned across multiple Brokers. Every topic is separated into Partitions.
  • Partitions: Partitions are sub-components of Kafka Topics. Partitions make it possible to divide Topics into smaller components, which can then be distributed across multiple Brokers that share in the work required to stream the messages associated with each Topic. Every Partition can be consumed by a single Consumer within a consumer group, which means that the maximum number of Consumers determines the maximum number of Partitions you can create.
  • Consumers: Last but not least, a Consumer is an application that consumes messages associated with a Kafka Topic from a Broker. In other words, a Consumer is the component on the receiving end of a Kafka data stream. We'll see later in this article, how configuring Consumers can have a dramatic impact on overall Kafka performance.

When you put these components together, you get a Kafka cluster that can stream one or several categories of messages using a distributed architecture.

Key Kafka concepts

In addition to understanding the basic Kafka cluster architecture, it's also useful to understand some key Kafka concepts – Consumer Groups, offsets, lag and replication factor – if you want to optimize the performance of your data stream.

Consumer Groups

A Consumer group is a collection of registered consumers under the same Consumer Group name – such as Consumer Group in the diagram below.

source: confluent.io 

Offsets

Every Consumer Group contains an offset for each Partition. The offset represents the location within the Partition where the Partition's Consumer is currently reading data.

Offsets allow Consumers to resume data consumption at the same spot where they left off, in the event that the Consumer goes down and restarts or any change in the amount of registered Consumers on that specific Consumer Group.

So, offsets are important for ensuring continuity within data streams, as well as for preventing data loss or missed messages during Consumer downtime.

Lag

In Kafka, lag is the number of messages that a consumer has to read before the offset of that partition reaches the end of the partition. Obviously, you want this number to be as small as possible if you're trying to stream data in real time, but there will always be some amount of lag due to issues like network throughput limitations. That said, as we explain below, being smart about the way you manage Kafka Consumers can help to minimize lag.

Replication factor

The replication factor is the number of times a Topic has been replicated across Brokers.

For example, if a Topic has a replication factor of 2, the Topic is being distributed on 2 Brokers. If one Broker were to go down, the Topic would still be available to Consumers because the other Broker would remain up.

The higher your replication factor, then, the greater the reliability of your Kafka data stream.

Creating a Kafka Consumer

We promised to focus on best practices for working with Kafka Consumers, so let's shift our focus to creating Consumers.

Here's an example of a basic Kafka Consumer  (borrowed from GitHub):

This code does three main things:

  1. It sets the addresses of the Kafka brokers.
  2. It selects the Topic to determine which messages the Consumer will consume.
  3. It selects a Consumer Group name, which can be used to manage offsets.

Using this configuration, Consumer will iterate over incoming messages, process them, and by default commit the offset every few seconds.

Kafka Consumer best practices

Kafka's distributed architecture provides some inherent advantages when it comes to data streaming and processing relative to centralized architectures, which are more prone to single points of failure.

But to get the very best performance from Kafka, you should adhere to core best practices – like those described below – when it comes to configuring components like Consumers, Topics and Partitions.

#1. Choose the right number of Partitions

Given that each Partition can be consumed by only a single Consumer, selecting the right number of Partitions is a critical consideration. The total number of Partitions limits the total number of Consumers that you can have for each topic. In general, the more Partitions you have, the better network throughput there will be because you can have more Consumers accessing data from more Brokers, which leads to more efficient use of the network.

On the other hand, if you create too many Partitions and too many Consumers, you may run into higher lag due to increases in the time required for Brokers to handle requests from each Consumer.

The general takeaway here is that, while more Partitions is generally a good thing, you need to monitor Kafka metrics like network throughput and lag carefully to ensure that you strike the right balance. There is such a thing as too many Partitions, but it's hard to know when you have too many if you don't monitor your cluster.

In addition, choosing a total Partition number that is divisible by both 2 and 3 is a best practice because it helps to ensure that each Consumer will receive an equal number of Partitions, which encourages efficient consumption.

#2. Maintain Consumer count consistency

In addition to selecting the right number of total Consumers, it's generally advantageous to try to keep your total number of Consumers as consistent as possible.

The reason why is that whenever a new Consumer is registered in a Consumer Group, the Brokers have to stop delivering messages temporarily while they redistribute the Partitions between the new Consumers. This is another way that you can end up with high lag.

Obviously, if the applications on the receiving end of your data stream change in number, scaling Consumer count up or down may be unavoidable. But if you don't need to scale, keep your Consumer count constant. And when scaling is necessary, consider using "stiff" autoscaling rules in order to minimize the rate at which Consumers are rebalanced.

#3. Use a replication factor greater than 2

Kafka will let you set a replication factor of 1. But as we explained above, having a replication factor of at least 2 is a best practice because it ensures that if one Broker goes down, you'll have at least one other Broker still running to keep your data stream operating.

Replication factors of 1 only make sense if you simply can't afford redundant Brokers, or if you're dealing with a Topic that is not mission-critical.

#4. Enable idempotence

Idempotence refers to the property of certain operations (inserting to a database for example) to have the same effect, no matter how many times they are performed.

Kafka can resent the same message again and in some cases, you can use that feature to rewind the messages to your consumer in case of message processing failures.

Get more from your Kafka cluster

The bottom line is that Kafka is a powerful and flexible platform for streaming data. But like most powerful and flexible platforms, the exact value that Kafka delivers depends heavily on how well it's configured. To get the best performance and reliability out of Kafka, it's important to understand the nuances surrounding components like Consumers, Topics and Partitions to ensure configuration in ways that prime your cluster for minimal lag, maximum network throughput and as much reliability as you can reasonably afford.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.