Maybe dplyr::tbl_cube ?
Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from dplyr::tbl_cube
. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.
require(dplyr)
Couple of caveats:
It's an early release: all the issues that go along with that
It's recommended for this version to unload plyr when dplyr is loaded
Loading arrays into cubes
Here's an example using arr
as defined in the other answer:
# using arr from previous example
# we can convert it simply into a tbl_cube
arr.cube<-as.tbl_cube(arr)
arr.cube
#Source: local array [24 x 3]
#D: ser [chr, 3]
#D: smp [chr, 2]
#D: tr [chr, 4]
#M: arr [dbl[3,2,4]]
So note that D means Dimensions and M Measures, and you can have as many as you like of each.
Easy conversion from multi-dimensional to flat
You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)
head(as.data.frame(arr.cube))
# ser smp tr arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929
Subsetting
You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:
arr.cube.filtered<-filter(arr.cube,ser=="ser 1")
as.data.frame(arr.cube.filtered)
# ser smp tr arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 1 smp 2 tr 1 0.9444435
#3 ser 1 smp 1 tr 2 0.4331116
#4 ser 1 smp 2 tr 2 0.3916376
#5 ser 1 smp 1 tr 3 0.4669228
#6 ser 1 smp 2 tr 3 0.8942300
#7 ser 1 smp 1 tr 4 0.2054326
#8 ser 1 smp 2 tr 4 0.1006973
tbl_cube currently works with the dplyr
functions summarise()
, select()
, group_by()
and filter()
. Usefully you can chain these together with the %.%
operator.
For the rest of the examples, I'm going to use the inbuilt nasa
tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):
Grouping and summary measures
nasa
#Source: local array [41,472 x 4]
#D: lat [dbl, 24]
#D: long [dbl, 24]
#D: month [int, 12]
#D: year [int, 6]
#M: cloudhigh [dbl[24,24,12,6]]
#M: cloudlow [dbl[24,24,12,6]]
#M: cloudmid [dbl[24,24,12,6]]
#M: ozone [dbl[24,24,12,6]]
#M: pressure [dbl[24,24,12,6]]
#M: surftemp [dbl[24,24,12,6]]
#M: temperature [dbl[24,24,12,6]]
So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:
plot_data<-as.data.frame( # as.data.frame so we can see the data
filter(nasa,long<(-70)) %.% # filter long < (-70) (arbitrary!)
group_by(lat,long) %.% # group by lat/long combo
summarise(p.max=max(pressure), # create summary measures for each group
o.avg=mean(ozone),
c.all=(cloudhigh+cloudlow+cloudmid)/3)
)
head(plot_data)
# lat long p.max o.avg c.all
#1 36.20000 -113.8 975 310.7778 22.66667
#2 33.70435 -113.8 975 307.0833 21.33333
#3 31.20870 -113.8 990 300.3056 19.50000
#4 28.71304 -113.8 1000 290.3056 16.00000
#5 26.21739 -113.8 1000 282.4167 14.66667
#6 23.72174 -113.8 1000 275.6111 15.83333
Consistent notation for n-d and 2-d data structures
Sadly the mutate()
function isn't yet implemented for tbl_cube
but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:
plot_data.mod<-filter(plot_data,lat>25) %.% # filter out lat <=25
mutate(arb.meas=o.avg/p.max) # make a new column
head(plot_data.mod)
# lat long p.max o.avg c.all arb.meas
#1 36.20000 -113.8000 975 310.7778 22.66667 0.3187464
#2 33.70435 -113.8000 975 307.0833 21.33333 0.3149573
#3 31.20870 -113.8000 990 300.3056 19.50000 0.3033389
#4 28.71304 -113.8000 1000 290.3056 16.00000 0.2903056
#5 26.21739 -113.8000 1000 282.4167 14.66667 0.2824167
#6 36.20000 -111.2957 930 313.9722 20.66667 0.3376045
Plotting - as an example of R functionality that "likes" flat data
Then you can plot with ggplot()
using the benefits of flattened data:
# plot as you like:
ggplot(plot_data.mod) +
geom_point(aes(lat,long,size=c.all,color=c.all,shape=cut(p.max,6))) +
facet_grid( lat ~ long ) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Using data.table on the resulting flat data
I'm not going to expand on the use of data.table
here, as it's done well in the previous answer. Obviously there are many good reasons to use data.table
- for any situation here you can return one by a simple conversion of the data.frame:
data.table(as.data.frame(your_cube_name))
Working dynamically with your cube
Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with arr.cube
- adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]
head(as.data.frame(arr.cube))
# ser smp tr arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929
arr.cube$mets$arr.bump<-arr.cube$mets$arr*1.1 #arb modification!
head(as.data.frame(arr.cube))
# ser smp tr arr arr.bump
#1 ser 1 smp 1 tr 1 0.6656456 0.7322102
#2 ser 2 smp 1 tr 1 0.6181301 0.6799431
#3 ser 3 smp 1 tr 1 0.7335676 0.8069244
#4 ser 1 smp 2 tr 1 0.9444435 1.0388878
#5 ser 2 smp 2 tr 1 0.8977054 0.9874759
#6 ser 3 smp 2 tr 1 0.9361929 1.0298122
Dimensions - or not ...
I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube$dims[$...]
) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.
Persistance
Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:
tempfilename<-gsub("[ :-]","",paste0("DBX",(Sys.time()),".cub"))
# save:
save(arr.cube,file=tempfilename)
# load:
load(file=tempfilename)
Hope that helps!