Question

Scenario: I have a low-volume topic (~150msgs/sec) for which we would like to have a low propagation delay from producer to consumer.

I added a time stamp from a producer and read it at consumer to record the propagation delay, with default configurations the msg (of 20 bytes) showed a propagation delay of 1960ms to 1230ms. No network delay is involved since, I tried on a 1 producer and 1 simple consumer on the same machine.

When I have tried adjusting the topic flush interval to 20ms, it drops to 1100ms to 980ms. Then I tried adjusting the consumers "fetcher.backoff.ms" to 10ms, it dropped to 1070ms - 860ms.

Issue: For a 20 bytes of a msg, I would like to have a propagation delay as low as possible and ~950ms is a higher figure.

Question: Anything I am missing out in configuration? I do welcome comments, delay which you got as minimum.

Assumption: The Kafka system involves the disk I/O before the consumer get the msg from the producer and this goes with the hard disk RPM and so on..


Update: Tried to tune the Log Flush Policy for Durability & Latency.
Following is the configuration:

# The number of messages to accept before forcing a flush of data to disk
log.flush.interval=10
# The maximum amount of time a message can sit in a log before we force a flush
log.default.flush.interval.ms=100
# The interval (in ms) at which logs are checked to see if they need to be 
# flushed to disk.
log.default.flush.scheduler.interval.ms=100

For the same msg of 20 bytes, the delay was 740ms -880ms.

The following statements are made clear in the configuration itself.
There are a few important trade-offs:

  1. Durability: Unflushed data is at greater risk of loss in the event of a crash.
  2. Latency: Data is not made available to consumers until it is flushed (which adds latency).
  3. Throughput: The flush is generally the most expensive operation.

So, I believe there is no way to come down to a mark of 150ms - 250ms. (without hardware upgrade) .

Was it helpful?

Solution

I am not trying to dodge the question but I think that kafka is a poor choice for this use case. While I think Kafka is great (I have been a huge proponent of its use at my workplace), its strength is not low-latency. Its strengths are high producer throughput and support for both fast and slow consumers. While it does provide durability and fault tolerance, so do more general purpose systems like rabbitMQ. RabbitMQ also supports a variety of different clients including node.js. Where rabbitMQ falls short when compared to Kafka is when you are dealing with extremely high volumes (say 150K msg/s). At that point, Rabbit's approach to durability starts to fall apart and Kafka really stands out. The durability and fault tolerance capabilities of rabbit are more than capable at 20K msg/s (in my experience).

Also, to achieve such high throughput, Kafka deals with messages in batches. While the batches are small and their size is configurable, you can't make them too small without incurring a lot of overhead. Unfortunately, message batching makes low-latency very difficult. While you can tune various settings in Kafka, I wouldn't use Kafka for anything where latency needed to be consistently less than 1-2 seconds.

Also, Kafka 0.7.2 is not a good choice if you are launching a new application. All of the focus is on 0.8 now so you will be on your own if you run into problems and I definitely wouldn't expect any new features. For future stable releases, follow the link here stable Kafka release

Again, I think Kafka is great for some very specific, though popular, use cases. At my workplace we use both Rabbit and Kafka. While that may seem gratuitous, they really are complimentary.

OTHER TIPS

I know it's been over a year since this question was asked, but I've just built up a Kafka cluster for dev purposes, and we're seeing <1ms latency from producer to consumer. My cluster consists of three VM nodes running on a cloud VM service (Skytap) with SAN storage, so it's far from ideal hardware. I'm using Kafka 0.9.0.0, which is new enough that I'm confident the asker was using something older. I have no experience with older versions, so you might get this performance increase simply from an upgrade.

I'm measuring latency by running a Java producer and consumer I wrote. Both run on the same machine, on a fourth VM in the same Skytap environment (to minimize network latency). The producer records the current time (System.nanoTime()), uses that value as the payload in an Avro message, and sends (acks=1). The consumer is configured to poll continuously with a 1ms timeout. When it receives a batch of messages, it records the current time (System.nanoTime() again), then subtracts the receive time from the send time to compute latency. When it has 100 messages, it computes the average of all 100 latencies and prints to stdout. Note that it's important to run the producer and consumer on the same machine so that there is no clock sync issue with the latency computation.

I've played quite a bit with the volume of messages generated by the producer. There is definitely a point where there are too many and latency starts to increase, but it's substantially higher than 150/sec. The occasional message takes as much as 20ms to deliver, but the vast majority are between 0.5ms and 1.5ms.

All of this was accomplished with Kafka 0.9's default configurations. I didn't have to do any tweaking. I used batch-size=1 for my initial tests, but I found later that it had no effect at low volume and imposed a significant limit on the peak volume before latencies started to increase.

It's important to note that when I run my producer and consumer on my local machine, the exact same setup reports message latencies in the 100ms range -- the exact same latencies reported if I simply ping my Kafka brokers.

I'll edit this message later with sample code from my producer and consumer along with other details, but I wanted to post something before I forget.

EDIT, four years later:

I just got an upvote on this, which led me to come back and re-read. Unfortunately (but actually fortunately), I no longer work for that company, and no longer have access to the code I promised I'd share. Kafka has also matured several versions since 0.9.

Another thing I've learned in the ensuing time is that Kafka latencies increase when there is not much traffic. This is due to the way the clients use batching and threading to aggregate messages. It's very fast when you have a continuous stream of messages, but any time there is a moment of "silence", the next message will have to pay the cost to get the stream moving again.

It's been some years since I was deep in Kafka tuning. Looking at the latest version (2.5 -- producer configuration docs here), I can see that they've decreased linger.ms (the amount of time a producer will wait before sending a message, in hopes of batching up more than just the one) to zero by default, meaning that the aforementioned cost to get moving again should not be a thing. As I recall, in 0.9 it did not default to zero, and there was some tradeoff to setting it to such a low value. I'd presume that the producer code has been modified to eliminate or at least minimize that tradeoff.

Modern versions of Kafka seem to have pretty minimal latency as the results from here show:

2 ms (median) 3 ms (99th percentile) 14 ms (99.9th percentile)

Kafka can achieve around millisecond latency, by using synchronous messaging. With synchronous messaging, the producer does not collect messages into a patch before sending.

bin/kafka-console-producer.sh --broker-list my_broker_host:9092 --topic test --sync

The following has the same effect:

--batch-size 1 

If you are using librdkafka as Kafka client library, you must also set socket.nagle.disable=True

See https://aivarsk.com/2021/11/01/low-latency-kafka-producers/ for some ideas on how to see what is taking those milliseconds.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top