Вопрос

I have a huge data frame like this

 head(newdata)
      V1 V2 V3 V4    V5    V6      V7      V8
1     a 1941 2 14 -73.90 38.60 US009239     4
2     b 1941 2 14 -74.00 36.90 US009239     6
3     c 1941 2 14 -74.00 35.40 US009239     4
5     d 1941 2 15 -74.00 32.60 US009239     7
6     f 1941 2 15 -73.80 31.70 US009239v    1

and what I would like to do is to perform some operation on every subset of data characterised by the same V7. I tried splitting it with

split(data, list(data$V7), drop = TRUE)

and then calculating the min and max of V8 for every element of the list, but it take too much memory and is really slow.

How can I do it?

Это было полезно?

Решение

The following scheme may be helpful

indices <- 1:nrow(newdata)
groups <- split(indices, newdata$V7)
lapply(groups, function(idx) {
   subdata <- data[idx,]
   # some operations on subdata...
})

It prevents R for creating multiple sub-data.frames at once, and thus may reduce the memory usage. You may also try calling gc(TRUE) to force garbage collection between each iteration of lapply.

However, I'm conscious that this is not a highly elegant solution. :)

Другие советы

Using data.table:

require(data.table)
setDT(data)[, list(Max=max(V8), Min=min(V8)), by=V7]

with dplyr you can do:

 data %>% group_by(V7) %>% summarise(Max=max(V8), Min=min(V8))

hth

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top