문제

I'm wondering how other developers are setting up their local environments for working on Spark projects. Do you configure a 'local' cluster using a tool like Vagrant? Or, is it most common to SSH into a cloud environment, such as a cluster on AWS? Perhaps there are many tasks where a single-node cluster is adequate, and can be run locally more easily.

도움이 되었습니까?

해결책

Spark is intended to be pointed at large distributed data sets, so as you suggest, the most typical use cases will involve connecting to some sort of Cloud system like AWS.

In fact, if the data set you aim to analyze can fit on your local system, you'll usually find that you can analyze it just as simply using pure python. If you're trying to leverage a series of local VMs, you're going to run out of memory pretty quickly and jobs will either fail or grind to a halt.

With that said, a local instance of spark is very useful for the purpose of development.

One way that I've found that works is if I have a directory in HDFS with many files, I'll pull over a single file, develop locally, then port my spark script to my cloud system for execution. If you're using AWS, this is really helpful for avoiding big fees while you're developing.

다른 팁

Personally, I don't use vagrant with a local provisioning. I have installed a Spark cluster locally without HDFS which allows me to experiment and develop easily without the overwhelm of a virtual machine.

HDFS is not a requirement for local clusters, and it's also a kind of a system administration nightmare if you just need it for local testing.

Spark works very fine with local file systems which you'll also have to port when deploy on your Cloud of course.

Nevertheless, You can use vagrant with AWS provisioning to create a cluster for heavier testing.

Note: AWS works with S3 and not HDFS. AWS's HDFS version is ephemeral and if you put down the cluster, you'll loose all your computations. For persistence, you'll need to write back to S3.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top