Apply function to subset of data frame [duplicate]

https://stackoverflow.com/questions/23247251

08-07-2023
|

質問

I have a huge data frame like this

 head(newdata)
      V1 V2 V3 V4    V5    V6      V7      V8
1     a 1941 2 14 -73.90 38.60 US009239     4
2     b 1941 2 14 -74.00 36.90 US009239     6
3     c 1941 2 14 -74.00 35.40 US009239     4
5     d 1941 2 15 -74.00 32.60 US009239     7
6     f 1941 2 15 -73.80 31.70 US009239v    1

and what I would like to do is to perform some operation on every subset of data characterised by the same V7. I tried splitting it with

split(data, list(data$V7), drop = TRUE)

and then calculating the min and max of V8 for every element of the list, but it take too much memory and is really slow.

How can I do it?

解決

The following scheme may be helpful

indices <- 1:nrow(newdata)
groups <- split(indices, newdata$V7)
lapply(groups, function(idx) {
   subdata <- data[idx,]
   # some operations on subdata...
})

It prevents R for creating multiple sub-data.frames at once, and thus may reduce the memory usage. You may also try calling gc(TRUE) to force garbage collection between each iteration of lapply.

However, I'm conscious that this is not a highly elegant solution. :)

他のヒント

Using data.table:

require(data.table)
setDT(data)[, list(Max=max(V8), Min=min(V8)), by=V7]

with dplyr you can do:

 data %>% group_by(V7) %>% summarise(Max=max(V8), Min=min(V8))

hth

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow