Question

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.

With regards to this we are evaluating the following options:

  • EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job

  • Putting recent data into AWS- RDS, computing the aggregates via sql

  • Akka: It is a framework to build distributed applications via Actors and Message passing.

    If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.

Was it helpful?

Solution 2

The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.

Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

OTHER TIPS

I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.

you may want to look into Storm for stream processing

I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top