Question

I am working on Proof of Concept task. The task is to implement a feature of our product using Hadoop technology.

Feature is quite simple, we have a UI which will let you insert details about "Network Issue". All details about such a issue are captured and inserted into a table in Oracle DB. We then process data in this table and calculate a Health Score.

I have to use Hadoop instead of a traditional Db So my question is what to go for? Impala on HDFS? or Impala on Hbase ? or Hbase?

I am using a cloudera VM for the POC implementation.

As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data. Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS. I am very new to hadoop, can some one please help?

Was it helpful?

Solution

Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.

IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.

That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top