Domanda

I have a huge data frame like this

 head(newdata)
      V1 V2 V3 V4    V5    V6      V7      V8
1     a 1941 2 14 -73.90 38.60 US009239     4
2     b 1941 2 14 -74.00 36.90 US009239     6
3     c 1941 2 14 -74.00 35.40 US009239     4
5     d 1941 2 15 -74.00 32.60 US009239     7
6     f 1941 2 15 -73.80 31.70 US009239v    1

and what I would like to do is to perform some operation on every subset of data characterised by the same V7. I tried splitting it with

split(data, list(data$V7), drop = TRUE)

and then calculating the min and max of V8 for every element of the list, but it take too much memory and is really slow.

How can I do it?

È stato utile?

Soluzione

The following scheme may be helpful

indices <- 1:nrow(newdata)
groups <- split(indices, newdata$V7)
lapply(groups, function(idx) {
   subdata <- data[idx,]
   # some operations on subdata...
})

It prevents R for creating multiple sub-data.frames at once, and thus may reduce the memory usage. You may also try calling gc(TRUE) to force garbage collection between each iteration of lapply.

However, I'm conscious that this is not a highly elegant solution. :)

Altri suggerimenti

Using data.table:

require(data.table)
setDT(data)[, list(Max=max(V8), Min=min(V8)), by=V7]

with dplyr you can do:

 data %>% group_by(V7) %>% summarise(Max=max(V8), Min=min(V8))

hth

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top