Question

I am designing a prototype realtime monitor for processing fairly large amounts (>30G/day) of streaming numeric data. I would like to write this in Clojure, as the language seems to be well suited to the kind of "Observer + state machine" system that this will probably end up as.

The two main candidates I have found for a framework are Lamina and Storm. There is also Riemann and Pulse, but the former seems to be more of a full solution rather than a framework, and I'd rather not commit to a final design yet; Pulse's repo looks a little unmaintained?

What I would like to know is; what kinds of data- and work flow are these two projects optimised for? Storm seems to be more mature, but Lamina seems more composable and "Clojureic" (my background is Python, so I tend to rate this highly).

What I've found from reading online:

  • Storm seems to be Big Data(stream) focussed, the core is straight Java with a Clojure DSL. It appears to have pre=built handlers for a number of existing data sources.

  • Lamina is more a lightweight, reusable component that does the Clojure thing of coding to abstractions, meaning it can be reused as a base for other eventing systems. The data sources need to be handled in code.

  • Both have a useful set of aggregation/splitting/computation library functions out of the box. Lamina's graphviz integration is a nice touch.

Was it helpful?

Solution 2

Storm incorporates cluster management and handling of failed nodes in the flow because it was designed to be sort of "like Hadoop but for streaming", which from what I understand of your requirements seems to be closer to your use case.

OTHER TIPS

Storm probably isn't a bad choice, but "over 30GB per day" of numeric data isn't big data, it is tiny data. Any semi-modern computer can handle that much data easily on one node with lamina. You might want to go with Storm anyway, so that once you do get into a realm where you need more servers you can scale easily, but I imagine there's some initial friction to getting Storm set up (and some ongoing friction in maintaining the cluster), which will be wasted if you never have to scale up.

Lamina seems like an okay choice, but it appears to be totally lacking the killer feature of Storm--cluster computing management. A Storm cluster will take care of most of the dirty work of distributing your computation across a cluster of nodes, leaving you to just focus on your business logic so long as you fit it within the Storm framework. Lamina, from what I can see, provides a nice way to organize your computation, but then you'll have to take care of all the details of scaling that out if that's something you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top