I'm querying ~123 millions rows with 21 columns [...] which is not very enormous in my opinion
That's exactly the problem, it's not enormous. Hive is a big data solution and is not designed to run on small data-sets like the one your using. It's like trying to use a forklift to take out your kitchen trash. Sure, it will work, but it's probably faster to just take it out by hand.
Now, having said all that, you have a couple of options if you want realtime performance closer to that of a traditional RDBMS.
- Hive 0.13+ which uses TEZ, ORC and a number of other optimizations that greatly improve response time
- Impala (part of CDH distributions) which bypasses MapReduce altogether, but is more limited in file format support.
Edit:
I'm saying that with 2 datanodes i get the same performance than with 3
That's not surprising at all. Since Hive uses MapReduce to handle query operators (join, group by, ...) it incurs all the cost that comes with MapReduce. This cost is more or less constant regardless of the size of data and number of datanodes.
Let's say you have a dataset with 100 rows in it. You might see 98% of your processing time in MapReduce initialization and 2% in actual data processing. As the size of your data increases, the cost associated with MapReduce becomes negligible compared to the total time taken.