Backing up messages in S3 within a Storm topology

https://softwareengineering.stackexchange.com/questions/234459

03-10-2020
|

Question

In my project we are trying to build up a KIND-of-a-lambda storm based architecture. The component would be responsible for indexing the site usage events so we expect a quite massive random load. The solution for real-time processing of the messages seems fine, but in parallel to the "speed layer" we want to back up the messages in a raw form (just as they are coming down from the queue) in Amazon S3. Since writing a file per message is obviously out of the question, we need to somehow buffer/aggregate the messages before posting to S3 - and here is where the problems begin. We have two concurrent approaches and none seems perfect:

We used Redis as a buffer. Basically messages coming down from a queue (RabbitMQ) are buffered in Redis and once some preconfigured batch size (say 1000) is reached the buffer is flushed and stored to S3. The whole message proccesing cycle is not transactional. Once message is stored to Redis it is acknowledged in the queue. This means that if Redis dies the whole batch gets lost and this is not acceptable. We can think of Redis cluster to make the thing more bullet proof but... doesn't it seem like going in a wrong direction?
The second approach would be to use Trident Topology. Changing the input queue to Kafka makes things looking more or less straightforward. The message processing cycle can be transactional. HOWEVER, there is this one annoying keyword that keeps beeing repeated in the Trident documentation - small batches - that Trident assumes small batches. For example the TransactionalTridentKafkaSpout batches the messages into 2 seconds chunks. This is much too little for us. I'm not sure why the batches should be small, but if it's really required maybe the first approach is better?

Which one from above is better? Maybe somoeone would come up with a "third" idea?

Solution

My confusion was actually caused by not understanding the Rabbit MQ publish/subscribe concept well enough. The solution is - create two queues in RabbitMQ (bind to one exchange). One queue will be for the real time processing, the second one for the batch processing. Now the batch layer should not be storm at all, instead it should be some scheduled process that just wakes up periodically and collects all the messages that got queued in the meantime.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange