Question

I am using Hive (version 0.11.0) and trying to join two tables. One has 26,286,629 records and the other one has 931 records.

This is the query I am trying to run.

SELECT * 
FROM table_1 hapmap a tabl_2 b 
WHERE a.chrom = b.chrom 
  AND a.start_pos >= b.start_pos 
  AND a.end_pos <= b.end_pos
LIMIT 10
;

It looks fine at the first few minutes but if the map and reduce task reached to 100% both, then it starts to reduce again from 0%.

2014-02-20 20:38:35,531 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:36,533 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:37,536 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:38,539 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:39,541 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:40,544 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:41,547 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:42,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:43,554 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec

Here are the last 4KB logs of map and reduce task.

Map Task Log (Last 4KB)

.hadoop.mapred.MapTask: bufstart = 99180857; bufend = 16035989; bufvoid = 99614720
2014-02-20 19:57:03,008 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196599; kvend = 131062; length = 327680
2014-02-20 19:57:03,180 INFO org.apache.hadoop.mapred.MapTask: Finished spill 12
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: bufstart = 16035989; bufend = 32544041; bufvoid = 99614720
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: kvstart = 131062; kvend = 65525; length = 327680
2014-02-20 19:57:04,399 INFO org.apache.hadoop.mapred.MapTask: Finished spill 13
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: bufstart = 32544041; bufend = 48952648; bufvoid = 99614720
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65525; kvend = 327669; length = 327680
2014-02-20 19:57:05,598 INFO org.apache.hadoop.mapred.MapTask: Finished spill 14
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO ExecMapper: ExecMapper: processing 4000000 rows: used memory = 123701072
2014-02-20 19:57:06,562 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 finished. closing... 
2014-02-20 19:57:06,574 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 Close done
2014-02-20 19:57:06,575 INFO ExecMapper: ExecMapper: processed 4182243 rows: used memory = 128772720
2014-02-20 19:57:06,577 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2014-02-20 19:57:06,713 INFO org.apache.hadoop.mapred.MapTask: Finished spill 15
2014-02-20 19:57:06,720 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments
2014-02-20 19:57:06,730 INFO org.apache.hadoop.mapred.Merger: Merging 7 intermediate segments out of a total of 16
2014-02-20 19:57:08,308 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 272242546 bytes
2014-02-20 19:57:11,762 INFO org.apache.hadoop.mapred.Task: Task:attempt_201402201604_0005_m_000000_0 is done. And is in the process of commiting
2014-02-20 19:57:11,834 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201402201604_0005_m_000000_0' done.
2014-02-20 19:57:11,837 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-02-20 19:57:11,868 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2014-02-20 19:57:11,869 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName root for UID 0 from the native implementation

Reduce Task Log (Last 4KB)

x.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)

    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)

2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0" - Aborting...
2014-02-20 20:11:32,744 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)

    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)

Namenode log at (2014-02-20 20:11:32)

2014-02-20 20:01:28,769 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 93 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 46 SyncTimes(ms): 29
2014-02-20 20:11:32,984 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 94 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 47 SyncTimes(ms): 30
2014-02-20 20:11:32,987 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
2014-02-20 20:11:32,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000, call addBlock(/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0, DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1, null) from 172.27.250.92:55640: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
2014-02-20 20:14:48,995 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 48 SyncTimes(ms): 31
2014-02-20 20:18:28,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 172.27.114.218
2014-02-20 20:18:28,064 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 49 SyncTimes(ms): 32

Anyone can help me?

Thanks in advance.

Was it helpful?

Solution

Hive 11 is not able to infer JOIN conditions from the WHERE clause. The query as you wrote it will be executed as a cross product join and then filtered based on the where conditions. This is extremely expensive and you should use this instead:

SELECT * 
FROM table_1 hapmap a JOIN tabl_2 b ON a.chrom = b.chrom
WHERE
      a.start_pos >= b.start_pos 
  AND a.end_pos <= b.end_pos
LIMIT 10;

Based on what you've said, this should be executed as a mapjoin and will be much faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top