Exploiting Apache Spark Data

https://softwareengineering.stackexchange.com/questions/342402

07-01-2021
|

Pergunta

and sorry if the question seems a bit naive.

I'm currently reading tutorials about Kafka & Spark and there's something I can't figure out : how to exploit / expose the data Spark received.

Here's what I'm trying to understand :

A lot of events <=> Kafka broker <=> Spark receiver <=> Map/Reduce/Transform/Aggregate/MLearning <=> Storage ?? <=> access by end-users ??

I understand the left part of the workflow, you have some stream of events, distributed by a broker, then consumed by Spark receivers.

I've read about alot of features from Spark, which is able to transform RDDs into other RDDs (basically), using in-memory storage (which can also be persisted or cached). But then ??

I don't have a specific use case in mind, but imagine I want to : - keep an event log of the stream of events "for the record" - aggregate the data (a simple count for example) - apply some machine learning example (let's say regression) - keep some value of the last event that happened for fast operational access

In my mind, this involves different data storage systems, say Hadoop for the logs, Redis for the last event, etc.

Then I'd like users to be able to query every of that persisted data : - a simple REST API to get the latest event value - a complicated query-like system to fetch the event log - some reporting API to get the prediction of the ML algorithm

How would this been achieved through Spark ? Is Spark designed for such use ? Does Spark offer such database-persistent storage ? Or should this be different Kafka consumers of the same event ?

Thanks for the help, I'm a bit confused.

Solução

Spark programs are just java, scala or python code, so they can write data to all the same places any program can write them. In fact, spark does not actually do anything unless you write the end result somewhere with an output operation.

If the end result of a spark job is small, it can be written to a relational database or a web service or something of the sort by collecting it back from the executors to the driver (rdd.collect) and then writing it from there.

If the end result of the spark job is large, it can be written to a distributed store, like HDFS, HBase or Cassandra directly from the executors. If you look around you can find code examples for doing this that scale well across spark executors.

There are also ways of having a spark job as a data source. For example, hive can be configured so that it executes its queries using a spark job that it launches itself and collects the query results back to the hive process. R similarly can launch a spark job and use it for fetching and processing data prior to collecting it back into the R process.

Both streaming and batch jobs do the same thing: they read a data source in batches onto executors, where it is transformed according to the RDD operations you have defined, until each batch reaches an output or collect operation when it leaves the executor. The difference is that a batch job reads from a finite data source, and a streaming job reads from an infinite data source.

A concrete example: the system I'm building. It reads sensor data from a kafka stream (where it is sent to by various web service endpoints), and then does these things: (A) normalize the sensor data representation, (B) aggregate it onto higher levels, (C) write aggregated sensor data to a live event stream in kafka for realtime monitoring, and (D) write normalized and aggregated data to hbase for later querying.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange