Apache Spark Question

https://datascience.stackexchange.com/questions/12077

16-10-2019
|

Question

I am trying to parse the files using Stanford nlp in Spark in mapper function. How to set the number of mappers in Apache Spark? Please help me.

Solution

It automatically determines the amount of mappers by the number of partitions your data is in. You can call getNumberPartitions on your data source (RDD/DataFrame) to see how much it is and use repartition for scaling this up or coalesce to scale this down (you can use repartition for this as well but this is slower). Repartitioning is expensive however and should be avoided when unneccesary.

OTHER TIPS

Using Spark, there is no notion of "mappers" or "reducers". Each task you perform is Spark is achieved by executors (JVM with allocated ressources). Executors also have the ability to split themselves into multiple cores in order to multithread some tasks.

In order to improve your performs with Spark in cluster mode, you can play on the following parameters :

The number of executors - the more you have, the more tasks you can achieve in the same time. Keep in mind executors creation is limited by hardware you have.
The number of cores - this is the number of cores for each executor. An executor with 5 core is able to run 5 simple tasks at the same time.
The amount of memory - for each executor and the driver.
The number of partitions in your dataset.

To give you an example, suppose you have a dataset with 30 partitions. Thus, you have 30 tasks to achieve for each spark job. You have to choose the number of executors, the number of cores and the amount of memory. Keep in mind that each core in each executor is able to run a task at the same time.

An example of configuration could be :

6 executors.
5 cores.
2G per executor.

If you set to few executors and cores, you will have large computing latencies. However, if you set too much executors and cores, a part of them won't compute data because there will be too few partitions.

Here are some links to learn setting you Spark Context :

https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange