Question

I'm writing a hadoop job that crawls pages. The library I am using uses the file system to store crawl data while it crawls. I was sure that the library would have to be modified to use the HDFS since a completely different set of classes need to be used to interface with HDFS while the crawler library I am using uses java.io.

However, when a colleague used hdfs://localhost/path/to/storage, for the path to the storage folder the crawler worked, and it was able to write to the file system. I am trying to understand why this works, is there anything different about the hadoop based jvms that causes them to resolve hdfs:// prefixed paths to a path on the HDFS?

Was it helpful?

Solution

I don't know what kind of HDFS interface are you using. Hadoop provides a generic file system layer. If you don't specify the NameNode address in your hadoop configuration file(in HADOOP_HOME/conf/core-site.xml and the name of the property is "fs.default.name"), all your "hadoop fs ..." comments will default to local file system. So, if you don't know what the hadoop configuration is, including "hdfs://namenode:port/" as a prefix is a good idea.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top