Frage

I want to create a Spark Streaming application coded in Scala. I want my application to:

  • read from a HDFS Text File line by line
  • analyze every line as String and if needed modify it and:
  • keep state that is needed for the analysis in some kind of data structures (Hashes probably)
  • output of everything on text files (any kind)

I've had no problems with the first step:

val lines = ssc.textFileStream("hdfs://localhost:9000/path/")

My analysis consist in searching a match in the Hashes for some fields of the String analyzed, that's why I need to maintain a state and do the process iteratively. The data in those Hashes is also extracted by the strings analyzed.

What can I do for next steps?

War es hilfreich?

Lösung

Since you just have to read one HDFS text file line by line, you probably do not need to Spark Streaming for that. You can just use Spark.

val lines = sparkContext.textFile("...")

Then you can use mapPartition to do a distributed processing of the whole partitioned file.

val processedLines = lines.mapPartitions { partitionAsIterator => 
  processPartitionAndReturnNewIterator(partitionAsIterator) 
}

In that function, you can walk through the lines in the partition, store state stuff in a hashmap, etc. and finally return another iterator of output records corresponding to that partition.

Now if you want share state across partitions, then you probably have to do some more aggregations like groupByKey() or reduceByKey() on processedLines dataset.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top