Question

I reviewed here,

How to make hive run mapreduce jobs concurrently?

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

Also, is setting this option equivalent to doing something like the following?

cat hive_script.hql | parallel --gnu hive -e '{}' 

My hive script can run in any order, as it is simply launching a bunch of jobs for each new (time based) partition of an existing table to create time based partitions of a derivative table.

If they are not equivalent in my case, will one of these strategies yield better performance?

Was it helpful?

Solution

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

add configuration into hive-site.xml (In my case, file path is ./.versions/hive-0.11.0/conf/hive-site.xml)

<property>
  <name>hive.exec.parallel</name>
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>

If they are not equivalent in my case, will one of these strategies yield better performance?

It's different. This property controls different stages paralle in one hive job, so performance depends on specific hive query.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top