Pergunta

I have the following scenario:

  • Around 70 million of equipments send a signal every 3~5 minutes to the server sending its id, status (online or offiline), IP, location (latitude and longitude), parent node and some other information.

  • The other information might not be in an standard format (so no schema for me) but I still need to query it.

  • The equipments might disappear for some time (or forever) not sending signals in the process. So I need a way to "forget" the equipments if they have not sent a signal in the last X days. Also new equipments might come online at any time.

  • I need to query all this data. Like knowing how many equipments are offline on a specific region or over an IP range. There won't be many queries running at the same time.

  • Some of the queries need to run fast (less than 3 min per query) and at the same time as the database is updating. So I need indexes on the main attributes (id, status, IP, location and parent node). The query results do not need to be 100% accurate, eventual consistency is fine as long as it doesn't take too long (more than 20 min on avarage) for them to appear in the queries results.

  • I don't need persistence at all, if the power goes out it's okay to lose everything.

Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.

The main problem I detect are the following:

  • Need to insert/update lots of tuples really fast and I don't know beforehand if the signal I receive is already in the database or not. Almost all of the signals will be in the same state as they were the last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?

  • Forgeting offline equipments. A batch job that runs during the night removing expired tuples would solve this problem.

  • There won't be many queries running at the same time, but they need to run fast. So I guess I need to have a cluster that perform a single query on multiple nodes of the cluster (does CouchDB MapReduce splits the workload to multiple nodes of the cluster?). I'm not enterily sure I need a cluster though, could a single more expensive machine handle all the load?

  • I have never used a noSQL system before, but I have theoretical knowledge of the subject.

Foi útil?

Solução

Does this make sense?

Apache Flume for collecting the signals.

It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.

Hive for batch queries.

Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.

HBase for random real-time reads/writes.

Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.

Impala for fast, interactive queries.

Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.

HTH

Outras dicas

Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top