Can we access HDFS file system and YARN scheduler in Apache Spark?

https://datascience.stackexchange.com/questions/4995

16-10-2019
|

Question

We can access HDFS file system and YARN scheduler In the Apache-Hadoop. But Spark has a higher level of coding. Is it possible to access HDFS and YARN in Apache-Spark too?

Thanks

Solution

Yes.

There are examples on spark official document: https://spark.apache.org/examples.html Just put your HDFS file uri in your input file path as below (scala syntax).

val file = spark.textFile("hdfs://train_data")

OTHER TIPS

HDFS

Spark was built as an alternative to MapReduce and thus supports most of its functionality. In particular, it means that "Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc."1. For most common data sources (like HDFS or S3) Spark automatically recognizes schema, e.g.:

val sc = SparkContext(...)
val localRDD = sc.textFile("file://...")
val hdfsRDD  = sc.textFile("hdfs://...")
val s3RDD    = sc.textFile("s3://...")

For more complicated cases you may need to work with lower-level functions like newAPIHadoopFile:

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])
val customRDD = sc.newAPIHadoopRDD(conf, classOf[MyCustomInputFormat], 
      classOf[MyCustomKeyClass],
      classOf[MyCustomValueClass])

But general rule is that if some data source is available for MapReduce, it can be easily reused in Spark.

YARN

Currently Spark supports 3 cluster managers / modes:

Standalone
Mesos
YARN

Standalone mode uses Spark's own master server and works for Spark only, while YARN and Mesos modes aim to share same set of system resources between several frameworks (e.g. Spark, MapReduce, Impala, etc.). Comparison of YARN and Mesos may be found here, and detailed description of Spark on YARN here.

And, in best traditions of Spark, you can switch between different modes simply by changing master URL.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange