Pergunta

I'am actually trying to implement a solution with Hadoop using Hive on CDH 5.0 with Yarn. So my architecture is: 1 Namenode 3 DataNode I'm querying ~123 millions rows with 21 columns

My node are virtualized with 2vCPU @2.27 and 8 GO RAM

So I tried some request and i got some result, and after that i tried the same requests in a basic MySQL with the same dataset in order to compare the results.

And actually MySQL is very faster than Hive. So I'm trying to understand why. I know I have some bad performance because of my hosts. My main question is : is my cluster well sizing ?

Do i need to add same DataNode for this amount of data (which is not very enormous in my opinion) ?

And if someone try some request with appoximately the same architecture, you are welcome to share me your results.

Thanks !

Foi útil?

Solução

I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion

That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.

Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.

  • Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
  • Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.

Edit:

I'm saying that with 2 datanodes i get the same performance than with 3

That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.

Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top