File processing from HDFS to Spark not working

https://stackoverflow.com/questions/22942948

29-06-2023
|

Question

I am trying to read file from HDFS on Spark Shell and getting error as below. When i create first RDD it works fine but when i try to do count on that RDD, it trows me some connection error. I have single node hdfs setup and on the same machine, i have spark running. Please help. When i run "jps" command on same box to see hadoop cluster is working as expected or not than i see everything alright and see below output.

[hadoop@idcrebalancedev ~]$ jps
23606 DataNode
28245 Jps
23982 TaskTracker
26537 Main
23738 SecondaryNameNode
23858 JobTracker
23488 NameNode

Below is the output for RDD creation and error on count.

scala> val hdfsFile = sc.textFile("hdfs://idcrebalancedev.bxc.is-teledata.com:23488/user/hadoop/reegal/4300.txt")
14/04/08 12:25:15 INFO MemoryStore: ensureFreeSpace(784) called with curMem=35456, maxMem=308713881
14/04/08 12:25:15 INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 784.0 B, free 294.4 MB)
hdfsFile: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at textFile at <console>:12

scala> hdfsFile.count()
14/04/08 12:25:22 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 0 time(s).
14/04/08 12:25:23 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 1 time(s).
14/04/08 12:25:24 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 2 time(s).
14/04/08 12:25:25 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 3 time(s).
14/04/08 12:25:26 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 4 time(s).
14/04/08 12:25:27 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 5 time(s).
14/04/08 12:25:28 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 6 time(s).
14/04/08 12:25:29 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 7 time(s).
14/04/08 12:25:30 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 8 time(s).
14/04/08 12:25:31 INFO Client: Retrying connect to server: idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488. Already tried 9 time(s).
java.net.ConnectException: Call to idcrebalancedev.bxc.is-teledata.com/172.29.253.4:23488 failed on connection exception: java.net.ConnectException: Connection refused
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
        at org.apache.hadoop.ipc.Client.call(Client.java:1075)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
        at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:898)
        at org.apache.spark.rdd.RDD.count(RDD.scala:720)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15)
        at $iwC$$iwC$$iwC.<init>(<console>:20)
        at $iwC$$iwC.<init>(<console>:22)
        at $iwC.<init>(<console>:24)
        at <init>(<console>:26)
        at .<init>(<console>:30)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:622)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
        at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
        at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:876)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:968)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:601)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206)
        at org.apache.hadoop.ipc.Client.call(Client.java:1050)
        ... 60 more


scala>

Output of lsof command on the box to see the listening port is working as expected or not.

[hadoop@idcrebalancedev ~]$ lsof -n|grep LIST
java      23488    hadoop   57u     IPv4              91020      0t0     TCP *:59730 (LISTEN)
java      23488    hadoop   66u     IPv4              91176      0t0     TCP 127.0.0.1:cslistener (LISTEN)
java      23488    hadoop   77u     IPv4              91321      0t0     TCP *:50070 (LISTEN)
java      23606    hadoop   57u     IPv4              91167      0t0     TCP *:32866 (LISTEN)
java      23606    hadoop   67u     IPv4              91567      0t0     TCP *:50010 (LISTEN)
java      23606    hadoop   68u     IPv4              91569      0t0     TCP *:50075 (LISTEN)
java      23606    hadoop   74u     IPv4              91599      0t0     TCP *:50020 (LISTEN)
java      23738    hadoop   57u     IPv4              91493      0t0     TCP *:49940 (LISTEN)
java      23738    hadoop   67u     IPv4              91642      0t0     TCP *:50090 (LISTEN)
java      23858    hadoop   57u     IPv4              91660      0t0     TCP *:46014 (LISTEN)
java      23858    hadoop   63u     IPv4              91778      0t0     TCP 127.0.0.1:etlservicemgr (LISTEN)
java      23858    hadoop   73u     IPv4              91806      0t0     TCP *:50030 (LISTEN)
java      23982    hadoop   61u     IPv4              91909      0t0     TCP 127.0.0.1:55097 (LISTEN)
java      23982    hadoop   78u     IPv4              92170      0t0     TCP *:50060 (LISTEN)
java      26537    hadoop   10u     IPv6            1805728      0t0     TCP *:40865 (LISTEN)
java      26537    hadoop   38u     IPv6            1805807      0t0     TCP 172.29.253.4:47852 (LISTEN)
java      26537    hadoop   42u     IPv6            1805810      0t0     TCP *:44402 (LISTEN)
java      26537    hadoop   43u     IPv6            1805812      0t0     TCP *:32796 (LISTEN)
java      26537    hadoop   44u     IPv6            1805816      0t0     TCP *:46234 (LISTEN)
java      26537    hadoop   45u     IPv6            1805818      0t0     TCP *:yo-main (LISTEN)

Solution 2

There are couple of issues here which i was able to find out.

1: We should not use web port which we use to access the web UI. I was using that initially so it was not working. 2: All requests should go to Name node and not anything else. 3: By replacing localhost:9000 in the above request, it started working fine.

One more question I have based on this is that how can I make it to work for domain rather than using for localhost and port. May be the answer to that is you need to change that in core-site.xml file and specify the proper web address rather than localhost there?

OTHER TIPS

I faced the similar issue while trying to access HDFS file though SPARK deployment (scala shell)

Its important to mention here that while configuring Hadoop cluster, core-site.xml is the file which contains 'name of file system' and the URI scheme.

We should refer this file and put make a spark RDD like:

val test= sc.testFile("hdfs://hostname:port/hdfs_file_path");

Example: My core-site.xml cotent:

 <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
 </property>

My HDFS file:

/user/hadoopuser/test_file.dat

Accessing through Scala

val textFile=sc.textFile("hdfs://localhost:54310/user/hadoopuser/test_file.dat")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow