문제

I have discovered some (significant) performance differences (in terms of real time runtime as well as CPU time) between Pig and Hive and am looking for ways to come to the bottom of these differences. I have used both language's explain feature (i.e. Hive: EXPLAIN keyword, Pig: pig -e 'explain -script explain.pig') to contrast and compare the generated syntax tree, logical, physical and map-reduce plans. However both seem to do the same things. The job tracker however shows a difference in the number of map and reduce tasks launched (I consequently ensured that both use the same number of map and reduce tasks and the performance difference remains). My question therefore is: in what other ways can I analyze what is going on (possibly at a lower level / bytecode level)?

EDIT: I am running the TPC-H benchmarks by the TPC (available https://issues.apache.org/jira/browse/PIG-2397 and https://issues.apache.org/jira/browse/HIVE-600 ). However even simpler scripts show a quite large performance difference. For example:

SELECT (dataset.age * dataset.gpa + 3) AS F1,
  (dataset.age/dataset.gpa - 1.5) AS F2 
  FROM  dataset
  WHERE dataset.gpa > 0;

I still need to fully evaluate the TPC-H benchmarks (will update later), however the results for the simpler scripts are detailed in this document: https://www.dropbox.com/s/16u3kx852nu6waw/output.pdf

(jpg: http://i.imgur.com/1j1rCWS.jpg )

도움이 되었습니까?

해결책

I have read some source codes of Pig and Hive before. I can share some opinions.

As I was focusing on the Join implementation, here I can provide some details of the Join implementation of Pig and Hive. Hive's Join implementation is less efficient than Pig. I have no idea why Hive needs to create so many objects (Such operations are very slow and should have been avoided) in the Join implementation. I think that's why Hive does Join more slowly than Pig. If you are interested in it, you can check the CommonJoinOperator code by yourself. So I guess that Pig usually more efficient as its high quality codes.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top