Question

I have a huge data frame like this

 head(newdata)
      V1 V2 V3 V4    V5    V6      V7      V8
1     a 1941 2 14 -73.90 38.60 US009239     4
2     b 1941 2 14 -74.00 36.90 US009239     6
3     c 1941 2 14 -74.00 35.40 US009239     4
5     d 1941 2 15 -74.00 32.60 US009239     7
6     f 1941 2 15 -73.80 31.70 US009239v    1

and what I would like to do is to perform some operation on every subset of data characterised by the same V7. I tried splitting it with

split(data, list(data$V7), drop = TRUE)

and then calculating the min and max of V8 for every element of the list, but it take too much memory and is really slow.

How can I do it?

Was it helpful?

Solution

The following scheme may be helpful

indices <- 1:nrow(newdata)
groups <- split(indices, newdata$V7)
lapply(groups, function(idx) {
   subdata <- data[idx,]
   # some operations on subdata...
})

It prevents R for creating multiple sub-data.frames at once, and thus may reduce the memory usage. You may also try calling gc(TRUE) to force garbage collection between each iteration of lapply.

However, I'm conscious that this is not a highly elegant solution. :)

OTHER TIPS

Using data.table:

require(data.table)
setDT(data)[, list(Max=max(V8), Min=min(V8)), by=V7]

with dplyr you can do:

 data %>% group_by(V7) %>% summarise(Max=max(V8), Min=min(V8))

hth

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top