Question

R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems? Are there strategies to be employed when using R with this size of dataset?

Was it helpful?

Solution

Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

OTHER TIPS

The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.

  • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
  • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
  • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
  • bigmemory : An R package which allows powerful and memory-efficient parallel analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

Some good answers here. I would like to join the discussion by adding the following three notes:

  1. The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

  2. Having said that, it is important to remember about other aspects of big data concept. Based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

  3. While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.

In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.

Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.

As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.

I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).

I think that there is actually a pletora of tools for working with big data in R. sparklyr will be a great player in that field. sparklyr is an R interface to Apache Spark and allows the connection with local and remote clusters, providing a dplyr back-end. One can also rely on Apache Spark's machine learning libraries. Furthermore parallel processing is possible with several packages such as rmpi and snow (user controlled) or doMC/foreach (system based).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top