Hive Table Mapping with MongoDB

https://stackoverflow.com/questions/23466517

15-07-2023
|

Question

Here I have tried to perform the map reduce operation using hiveql it is working for select query but it is throwing some exception for some aggregate and filter operation please help me to resolve it. I have added the mongo-hadoop jars in the appropriate palces

hive> select * from users; OK 1 Tom 28 2 Alice 18 3 Bob 29

hive> select * from users where age>=20; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator

Kill Command = /home/administrator/hadoop-2.2.0//bin/hadoop job  -kill job_1398687508122_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-05-05 12:08:41,195 Stage-1 map = 0%,  reduce = 0%
2014-05-05 12:08:57,723 Stage-1 map = 100%,  reduce = 0%`enter code here`
Ended Job = job_1398687508122_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1398687508122_0002_m_000000 (and more) from job job_1398687508122_0002

Task with the most failures(4): 
-----
Task ID:
  task_1398687508122_0002_m_000000
-----
Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException: Couldn't get next key/value from mongodb: 
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:197)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:183)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Couldn't get next key/value from mongodb: 
    at com.mongodb.hadoop.mapred.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:93)
    at com.mongodb.hadoop.mapred.input.MongoRecordReader.next(MongoRecordReader.java:98)
    at com.mongodb.hadoop.mapred.input.MongoRecordReader.next(MongoRecordReader.java:27)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
    ... 13 more
Caused by: com.mongodb.MongoException$Network: Read operation to server localhost/127.0.0.1:12345 failed on database test
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:253)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:273)
    at com.mongodb.DBCursor._check(DBCursor.java:368)
    at com.mongodb.DBCursor._hasNext(DBCursor.java:459)
    at com.mongodb.DBCursor.hasNext(DBCursor.java:484)
    at com.mongodb.hadoop.mapred.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:80)
    ... 16 more
Caused by: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at com.mongodb.DBPort._open(DBPort.java:223)
    at com.mongodb.DBPort.go(DBPort.java:125)
    at com.mongodb.DBPort.call(DBPort.java:92)
    at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244)
    ... 23 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Solution

In Hive, "select * from table" operates in a different mode than any other more complicated query. That query operates within the hive client, in a single JVM. The logic is that query will eventually have to print everything to the console from a single thread anyway, so just doing everything from that thread isn't any worse. Everything else, including just a simple filter, will run as one or more MapReduce jobs.

When you run the query without the filter, I'm guessing you're doing so on the same machine that MongoDB is running on, so it can connect to localhost:12345. But when you run a MapReduce job, it's a different machine trying to connect: a task node. The mapper tries to connect to "localhost:12345" to get data from Mongo, but can't do so. Maybe Mongo isn't running on that machine, or maybe it's running on a different port. I don't know how your cluster is configured.

Regardless, you should be specifying the location of the MongoDB instance in a way that all of the machines in your cluster can access it. If it has a local IP address that's pretty static, that will work, but better would be to do it by hostname and DNS resolution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow