I reviewed here,

How to make hive run mapreduce jobs concurrently?

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

Also, is setting this option equivalent to doing something like the following?

cat hive_script.hql | parallel --gnu hive -e '{}' 

My hive script can run in any order, as it is simply launching a bunch of jobs for each new (time based) partition of an existing table to create time based partitions of a derivative table.

If they are not equivalent in my case, will one of these strategies yield better performance?

有帮助吗?

解决方案

My question is how to set this "hive.exec.parallel.thread.number" option in an Amazon EMR cluster on startup?

add configuration into hive-site.xml (In my case, file path is ./.versions/hive-0.11.0/conf/hive-site.xml)

<property>
  <name>hive.exec.parallel</name>
  <value>true</value>
  <description>Whether to execute jobs in parallel</description>
</property>

If they are not equivalent in my case, will one of these strategies yield better performance?

It's different. This property controls different stages paralle in one hive job, so performance depends on specific hive query.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top