Question

I'm using R to pull in data through an API and merge all of it into a single table, which I then write to a CSV file. To graph it properly in Tableau, however, I need to prepare the data by using their reformatting tool for Excel to get it from a cross-tablulated format to a format where each line contains only one piece of data. For example, taking something from the format:

ID,Gender,School,Math,English,Science
1,M,West,90,80,70
2,F,South,50,50,50

To:

ID,Gender,School,Subject,Score
1,M,West,Math,90
1,M,West,English,80
1,M,West,Science,70
2,F,South,Math,50
2,F,South,English,50
2,F,South,Science,50

Are there any existing tools in R or in an R library that would allow me to do this, or that would provide a starting point? I am trying to automate the preparation of data for Tableau so that I just need to run a single script to get it formatted properly, and would like to remove the manual Excel step if possible.

Was it helpful?

Solution

In R and several other programs, this process is referred to as "reshaping" data. In fact, the Tableau page that you originally linked to speaks of their "Excel Reshaper plugin".

In base R, there are a few functions to reshape data, such as the (notorious) reshape() function which takes panel data from a wide form to a long form, and stack() which creates skinny stacks of your data.

The "reshape2" package seems to be much more popular for such data transformations, though. Here's an example of "melting" your sample data, which I've stored in a data.frame named "mydf":

library(reshape2)
melt(mydf, id.vars=c("ID", "Gender", "School"), 
     value.name="Score", variable.name="Subject")
#   ID Gender School Subject Score
# 1  1      M   West    Math    90
# 2  2      F  South    Math    50
# 3  1      M   West English    80
# 4  2      F  South English    50
# 5  1      M   West Science    70
# 6  2      F  South Science    50

For this example, base R's reshape() isn't really appropriate, but stack() is. Here, I've stacked just the last three columns:

stack(mydf[4:6])
#   values     ind
# 1     90    Math
# 2     50    Math
# 3     80 English
# 4     50 English
# 5     70 Science
# 6     50 Science

To get the data.frame you are looking for, you would cbind the first three columns with the above output.


For reference, Hadley Wickham's Tidy Data paper is a good entry point into thinking about how the structure of your data might facilitate further processing and visualization.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top