What are the differences between Apache Spark and Apache Flink? [closed]

https://datascience.stackexchange.com/questions/4968

bigdata

16-10-2019
|

Question

Both Apache-Spark and Apache-Flink projects claim pretty much similar capabilities.

what is the difference between these projects. Is there any advantage in either Spark or Flink?

Thanks

Solution

Flink is the Apache renaming of the Stratosphere project from several universities in Berlin. It doesn't have the same industrial foothold and momentum that the Spark project has, but it seems nice, and more mature than, say, Dryad. I'd say it's worth investigating, at least for personal or academic use, but for industrial deployment I'd still prefer Spark, which at this point is battle tested. For a more technical discussion, see this Quora post by committers on both projects.

OTHER TIPS

Feature wise comparison between Spark vs Flink:

Data Processing

Spark: Apache Spark is also a part of Hadoop Ecosystem. It is a batch processing System at heart too but it also supports stream processing.

Flink: Apache Flink provides a single runtime for the streaming and batch processing.
Streaming Engine

Spark: Apache Spark Streaming processes data streams in micro-batches. Each batch contains a collection of events that arrived over the batch period. But it is not enough for use cases where we need to process large streams of live data and provide results in real time.

Flink: Apache Flink is the true streaming engine. It uses streams for workloads: streaming, SQL, micro-batch, and batch. Batch is a finite set of streamed data.
Data Flow

Spark: Though Machine Learning algorithm is a cyclic data flow, Spark represents it as (DAG) direct acyclic graph.

Flink: Flink takes a different approach than others. It supports controlled cyclic dependency graph in run time. This helps to represent the Machine Learning algorithms in a very efficient way.
Computation Model

Spark: Spark has adopted micro-batching. Micro-batches are an essentially “collect and then process” kind of computational model.

Flink: Flink has adopted a continuous flow, operator-based streaming model. A continuous flow operator processes data when it arrives, without any delay in collecting the data or processing the data.
Performance

Spark: Though Apache Spark has an excellent community background and now It is considered as most matured community. But Its stream processing is not much efficient than Apache Flink as it uses micro-batch processing.

Flink: Performance of Apache Flink is excellent as compared to any other data processing system. Apache Flink uses native closed loop iteration operators which make machine learning and graph processing more faster when we compare Hadoop vs Spark vs Flink.
Memory management

Spark: It provides configurable memory management. The latest release of Spark 1.6 has moved towards automating memory management.

Flink: It provides automatic memory management. It has its own memory management system, separate from Java’s garbage collector.
Fault tolerance

Spark: Apache Spark Streaming recovers lost work and with no extra code or configuration, it delivers exactly-once semantics out of the box. Read more about Spark Fault Tolerance.

Flink: The fault tolerance mechanism followed by Apache Flink is based on Chandy-Lamport distributed snapshots. The mechanism is lightweight, which results in maintaining high throughput rates and provide strong consistency guarantees at the same time.
Scalability

Spark: It is highly scalable, we can keep adding n number of nodes in the cluster. A large known sSpark cluster is of 8000 nodes.

Flink: Apache Flink is also highly scalable, we can keep adding n number of nodes in the cluster A large known Flink cluster is of thousands of nodes.
Iterative Processing

Spark: It iterates its data in batches. In Spark, each iteration has to be scheduled and executed separately.

Flink: It iterates data by using its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of the job.
Language Support

Spark: It supports Java, Scala, Python and R. Spark is implemented in Scala. It provides API in other languages like Java, Python, and R.

Flink: It Supports Java, Scala, Python and R. Flink is implemented in Java. It does provide Scala API too.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange