Traditional Message Brokers and Streaming Data

https://softwareengineering.stackexchange.com/questions/350312

13-01-2021
|

Question

According to the Kafka site:

"Kakfa is used for building real-time data pipelines and streaming apps."

Searching the internet far and wide, I've found the following generally-accepted definition of what "stream data" is:

Stream data is data that flows contiguously from a source to a destination over a network; and
Stream data is not atomic in nature, meaning any part of a flowing stream of data is meaningful and processable, as opposed to a file whose bytes don't mean anything unless you have all of them; and
Stream data can be started/stopped at any time; and
Consumers can attach and detach from a stream of data at will, and process just the parts of it that they want

Now then, if anything I've said above is incorrect, incomplete or totally wrong, please begin by correcting me! Assuming I'm more or less on track, then...

Now that I understand what "streaming data" is, then I understand what Kafka and Kinesis mean when they bill themselves as processing/brokering middleware for applications with streaming data. But it has piqued my interests: can/should "stream middleware" like Kafka or Kinesis be used for non-streaming data, like traditional message brokers? And vice versa: can/should traditional MQs like RabbitMQ, ActiveMQ, Apollo, etc. be used for streaming data?

Let's take an example where an application will be sending its backend constant barrage of JSON messages that need to be processed, and the processing is fairly complex (validation, transforms on the data, filtering, aggregations, etc.):

Case #1: The messages are each frames of a movie; that is one JSON messgage per video frame containing the frame data and some supporting metadata
Case #2: The messages are time-series data, perhaps someone's heartbeat as a function of time. So Message #1 is sent representing my heartbeat at t=1, Message #2 contains my heartbeat at t=2, etc.
Case #3: The data is completely disparate and non-related by time or as part of any "data stream". Perhaps audit/security events that get fired as hundreds of users navigate the application clicking buttons and taking actions

Based on how Kafka/Kinesis are billed and on my understanding of what "streaming data" is, they seem to be obvious candidates for Cases #1 (contiguous video data) and #2 (contiguous time-series data). However I don't see any reason why a traditional message broker like RabbitMQ couldn't efficiently handle both these inputs as well.

And with Case #3, we're only provided with an event that has occurred and we need to process a reaction to that event. So to me this speaks to needing a traditional broker like RabbitMQ. But there's also no reason why you couldn't have Kafka or Kinesis handle the processing of event data either.

So basically, I'm looking to establish a rubric that says: I have X data with Y characteristics. I should use a stream processor like Kafka/Kinesis to handle it. Or, conversely, one that helps me determine: I have W data with Z characteristics. I should use a traditional message broker to handle it.

So I ask: What factors about the data (or otherwise) help steer the decision between stream processor or message broker, since both can handle streaming data, and both can handle (non-streaming) message data?

Solution

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers. Kafka does not know whether you are subscribed, or whether a message has been consumed. A message is simply committed to the log, where any interested party can read it.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Netflix could use Kafka to move frames around in an internal system that transcodes terabytes of video per hour and saves it to disk, but not to ship them to your screen.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

OTHER TIPS

Kafka/Kinesis is modelled as a stream. A stream has different properties than messages.

Streams have context to them. They have order. You can apply window functions on streams. Although each item in a stream is meaningful, it may be more meaningful with the context around it
Because streams have order, you can use that to make certain statements about the semantics of processing. E.g. Apache Trident supposedly has exactly-once semantics when consuming from a Kafka stream.
You can apply functions to streams. You can transform a stream without actually consuming it. You can lazily consume a stream. You can skip parts of a stream.
You can inherently replay streams in Kafka, but you can't (without additional software) replay message queues. This is useful when you don't even know what you want to do with the data yet. It's also useful for training AI.

Generally, use Kafka for offline stream processing, use message queues for real-time client-server messages.

Example use cases from pivotal:

Kafka: Website Activity Tracking, Metrics, Log Aggregation, Stream Processing, Event Sourcing and Commit logs

RabbitMQ: general purpose messaging..., often used to allow web servers to respond to requests quickly instead of being forced to perform resource-heavy procedures while the user waits for the result. Use when you need to use existing protocols like AMQP 0-9-1, STOMP, MQTT, AMQP 1.0

It may sometimes be useful to use both! For example in Use Case #2, if this was a stream of data from a pace-maker say, I would have pace-maker transmit heartbeat data to a RabbitMQ message queue (using a cool protocol like MQTT) where it is immediately processed to see if the source's heart is still beating. This could power a dashboard and an emergency response system. The message queue would also deposit the time series data into Kafka so that we could analyse the heartbeat data over time. For example we might implement an algorithm to detect heart disease by noticing trends in the heartbeat stream.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange