Question

Both Apache-Spark and Apache-Flink projects claim pretty much similar capabilities.

what is the difference between these projects. Is there any advantage in either Spark or Flink?

Thanks

Was it helpful?

Solution

Flink is the Apache renaming of the Stratosphere project from several universities in Berlin. It doesn't have the same industrial foothold and momentum that the Spark project has, but it seems nice, and more mature than, say, Dryad. I'd say it's worth investigating, at least for personal or academic use, but for industrial deployment I'd still prefer Spark, which at this point is battle tested. For a more technical discussion, see this Quora post by committers on both projects.

OTHER TIPS

Feature wise comparison between Spark vs Flink:

  1. Data Processing

    Spark: Apache Spark is also a part of Hadoop Ecosystem. It is a batch processing System at heart too but it also supports stream processing.

    Flink: Apache Flink provides a single runtime for the streaming and batch processing.

  2. Streaming Engine

    Spark: Apache Spark Streaming processes data streams in micro-batches. Each batch contains a collection of events that arrived over the batch period. But it is not enough for use cases where we need to process large streams of live data and provide results in real time.

    Flink: Apache Flink is the true streaming engine. It uses streams for workloads: streaming, SQL, micro-batch, and batch. Batch is a finite set of streamed data.

  3. Data Flow

    Spark: Though Machine Learning algorithm is a cyclic data flow, Spark represents it as (DAG) direct acyclic graph.

    Flink: Flink takes a different approach than others. It supports controlled cyclic dependency graph in run time. This helps to represent the Machine Learning algorithms in a very efficient way.

  4. Computation Model

    Spark: Spark has adopted micro-batching. Micro-batches are an essentially “collect and then process” kind of computational model.

    Flink: Flink has adopted a continuous flow, operator-based streaming model. A continuous flow operator processes data when it arrives, without any delay in collecting the data or processing the data.

  5. Performance

    Spark: Though Apache Spark has an excellent community background and now It is considered as most matured community. But Its stream processing is not much efficient than Apache Flink as it uses micro-batch processing.

    Flink: Performance of Apache Flink is excellent as compared to any other data processing system. Apache Flink uses native closed loop iteration operators which make machine learning and graph processing more faster when we compare Hadoop vs Spark vs Flink.

  6. Memory management

    Spark: It provides configurable memory management. The latest release of Spark 1.6 has moved towards automating memory management.

    Flink: It provides automatic memory management. It has its own memory management system, separate from Java’s garbage collector.

  7. Fault tolerance

    Spark: Apache Spark Streaming recovers lost work and with no extra code or configuration, it delivers exactly-once semantics out of the box. Read more about Spark Fault Tolerance.

    Flink: The fault tolerance mechanism followed by Apache Flink is based on Chandy-Lamport distributed snapshots. The mechanism is lightweight, which results in maintaining high throughput rates and provide strong consistency guarantees at the same time.

  8. Scalability

    Spark: It is highly scalable, we can keep adding n number of nodes in the cluster. A large known sSpark cluster is of 8000 nodes.

    Flink: Apache Flink is also highly scalable, we can keep adding n number of nodes in the cluster A large known Flink cluster is of thousands of nodes.

  9. Iterative Processing

    Spark: It iterates its data in batches. In Spark, each iteration has to be scheduled and executed separately.

    Flink: It iterates data by using its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of the job.

  10. Language Support

    Spark: It supports Java, Scala, Python and R. Spark is implemented in Scala. It provides API in other languages like Java, Python, and R.

    Flink: It Supports Java, Scala, Python and R. Flink is implemented in Java. It does provide Scala API too.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top