Question

I am using Hive (version 0.11.0) and trying to join two tables. One has 26,286,629 records and the other one has 931 records.

This is the query I am trying to run.

SELECT * 
FROM table_1 hapmap a tabl_2 b 
WHERE a.chrom = b.chrom 
  AND a.start_pos >= b.start_pos 
  AND a.end_pos <= b.end_pos
LIMIT 10
;

It looks fine at the first few minutes but if the map and reduce task reached to 100% both, then it starts to reduce again from 0%.

2014-02-20 20:38:35,531 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:36,533 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:37,536 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:38,539 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:39,541 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:40,544 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:41,547 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:42,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:43,554 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 192.49 sec

Here are the last 4KB logs of map and reduce task.

Map Task Log (Last 4KB)

.hadoop.mapred.MapTask: bufstart = 99180857; bufend = 16035989; bufvoid = 99614720
2014-02-20 19:57:03,008 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196599; kvend = 131062; length = 327680
2014-02-20 19:57:03,180 INFO org.apache.hadoop.mapred.MapTask: Finished spill 12
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: bufstart = 16035989; bufend = 32544041; bufvoid = 99614720
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: kvstart = 131062; kvend = 65525; length = 327680
2014-02-20 19:57:04,399 INFO org.apache.hadoop.mapred.MapTask: Finished spill 13
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: bufstart = 32544041; bufend = 48952648; bufvoid = 99614720
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65525; kvend = 327669; length = 327680
2014-02-20 19:57:05,598 INFO org.apache.hadoop.mapred.MapTask: Finished spill 14
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO ExecMapper: ExecMapper: processing 4000000 rows: used memory = 123701072
2014-02-20 19:57:06,562 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 finished. closing... 
2014-02-20 19:57:06,574 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing... 
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 Close done
2014-02-20 19:57:06,575 INFO ExecMapper: ExecMapper: processed 4182243 rows: used memory = 128772720
2014-02-20 19:57:06,577 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2014-02-20 19:57:06,713 INFO org.apache.hadoop.mapred.MapTask: Finished spill 15
2014-02-20 19:57:06,720 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments
2014-02-20 19:57:06,730 INFO org.apache.hadoop.mapred.Merger: Merging 7 intermediate segments out of a total of 16
2014-02-20 19:57:08,308 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 272242546 bytes
2014-02-20 19:57:11,762 INFO org.apache.hadoop.mapred.Task: Task:attempt_201402201604_0005_m_000000_0 is done. And is in the process of commiting
2014-02-20 19:57:11,834 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201402201604_0005_m_000000_0' done.
2014-02-20 19:57:11,837 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-02-20 19:57:11,868 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2014-02-20 19:57:11,869 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName root for UID 0 from the native implementation

Reduce Task Log (Last 4KB)

x.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)

    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)

2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0" - Aborting...
2014-02-20 20:11:32,744 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)

    at org.apache.hadoop.ipc.Client.call(Client.java:1113)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)

Namenode log at (2014-02-20 20:11:32)

2014-02-20 20:01:28,769 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 93 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 46 SyncTimes(ms): 29
2014-02-20 20:11:32,984 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 94 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 47 SyncTimes(ms): 30
2014-02-20 20:11:32,987 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
2014-02-20 20:11:32,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000, call addBlock(/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0, DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1, null) from 172.27.250.92:55640: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
2014-02-20 20:14:48,995 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 48 SyncTimes(ms): 31
2014-02-20 20:18:28,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 172.27.114.218
2014-02-20 20:18:28,064 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 49 SyncTimes(ms): 32

Anyone can help me?

Thanks in advance.

Était-ce utile?

La solution

Hive 11 is not able to infer JOIN conditions from the WHERE clause. The query as you wrote it will be executed as a cross product join and then filtered based on the where conditions. This is extremely expensive and you should use this instead:

SELECT * 
FROM table_1 hapmap a JOIN tabl_2 b ON a.chrom = b.chrom
WHERE
      a.start_pos >= b.start_pos 
  AND a.end_pos <= b.end_pos
LIMIT 10;

Based on what you've said, this should be executed as a mapjoin and will be much faster.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top