Possibility of working on KDDCup data in local system

https://datascience.stackexchange.com/questions/5502

16-10-2019
|

Question

I'm trying to apply classification algorithms to KDD Cup 2012 track2 data using R http://www.kddcup2012.org/c/kddcup2012-track2

It seems not possible to work with this 10GB training data on my local system with 4GB RAM. Can anyone work on this data using this kind of a local system ? Or is using a cluster the norm ?
It would be great if anyone could provide me with any guidance on how to get started with working on a cluster and the normally used type of cluster for such tasks

Solution

I think that you have, at least, the following major options for your data analysis scenario:

Use big data-enabling R packages on your local system. You can find most of them via the corresponding CRAN Task View that I reference in this answer (see point #3).
Use the same packages on a public cloud infrastructure, such as Amazon Web Services (AWS) EC2. If your analysis is non-critical and tolerant to potential restarts, consider using AWS Spot Instances, as their pricing allows for significant financial savings.
Use the above mention public cloud option with R standard platform, but on more powerful instances (for example, on AWS you can opt for memory-optimized EC2 instances or general purpose on-demand instances with more memory).

In some cases, it is possible to tune a local system (or a cloud on-demand instance) to enable R to work with big(ger) data sets. For some help in this regard, see my relevant answer.

For both above-mentioned cloud (AWS) options, you can find more convenient to use R-focused pre-built VM images. See my relevant answer for details. You may also find useful this excellent comprehensive list of big data frameworks.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange