Question

I have a list of data frames of the following form:

str(mylist)
List of 2
 $ df1:'data.frame':    50 obs. of  4 variables:
  ..$ var1: num [1:50] 0.114 0.622 0.609 0.623 0.861 ...
  ..$ var2: num [1:50] -1.221 1.819 0.195 1.232 0.786 ...
  ..$ var3: num [1:50] -0.14 -1.003 -0.352 0.647 0.424 ...
  ..$ Y   : num [1:50] -1.24 1.38 0.3 2.44 2.09 ...
 $ df2:'data.frame':    50 obs. of  4 variables:
  ..$ var1: num [1:50] 0.114 0.622 0.609 0.623 0.861 ...
  ..$ var2: num [1:50] -1.221 1.819 0.195 1.232 0.786 ...
  ..$ var3: num [1:50] -0.14 -1.003 -0.352 0.647 0.424 ...
  ..$ Y   : num [1:50] -1.24 1.38 0.3 2.44 2.09 ...
 - attr(*, "class")= chr [1:2] "mi" "list"

I am trying to return the means of the data frames in the list corresponding to the correct variable, also as a data frame, to look like:

> str(dfnew)
'data.frame':   50 obs. of  4 variables:
 $ var1: num  0.114 0.622 0.609 0.623 0.861 ...
 $ var2: num  -1.221 1.819 0.195 1.232 0.786 ...
 $ var3: num  -0.14 -1.003 -0.352 0.647 0.424 ...
 $ Y   : num  -1.24 1.38 0.3 2.44 2.09 ...

So, something that does...

dfnew[1,1] <- mean(mylist[[1]]$var1[1], mylist[[2]]$var1[1], na.rm=T)
dfnew[2,1] <- mean(mylist[[1]]$var1[2], mylist[[2]]$var1[2], na.rm=T)
...
dfnew[50,1] <- mean(mylist[[1]]$var1[50], mylist[[2]]$var1[50], na.rm=T)
...
dfnew[1,2] <- mean(mylist[[1]]$var2[1], mylist[[2]]$var2[1], na.rm=T)
...
dfnew[50,4] <- mean(mylist[[1]]$var4[50], mylist[[2]]$var4[50], na.rm=T)

I can see how I would do this with a for loop...

...or by creating data frames of each variable,

var1df <- cbind(df1$var1, df2$var1)
var2df <- cbind(df1$var2, df2$var2) # and if there are up to var1000?...
...
dfnew$var1 <- rowMeans(var1df)
dfnew$var2 <- rowMeans(var2df)
...

but that's more copying than I'd like and seems less than idiomatic R; so I'm trying to do it with one of the apply functions.

Since this is a list, lapply seemed right, except that it seems to go across the wrong margin---that is, it's mean-ing within the list, rather than the mean across the lists.

> lapply(mylist, FUN=mean)
$df1
[1] NA

$df2
[1] NA

Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA

There's no setting in lapply for the other margin, cross-list rather than in-list.

And regular apply, which lets me set a margin is upset that this is a list, rather than a matrix or data frame.

> apply(mylist, MARGIN = 2, FUN=mean)
Error in apply(mylist, MARGIN = 2, FUN = mean) : 
  dim(X) must have a positive length

(My actual list has a lot more than 2 data frames, so a lot of the easier loopy or merge-y solutions get kind of hairy pretty quickly---or at least I'm too clumsy with the loop over getattribute stuff to know how to do it cleanly for length N.)

Is there something I'm missing in one of the rapply, tapply, eapply, *apply functions that would solve this, or something in general I'm being dumb about?

UPDATE

Thanks everyone for the helpful answers. I ran across this problem when I was testing out the Amelia libraries for multiple imputation and wanted to look at what the spread of the moments of the simulations were from the long-term means. (The object they return is shaped like this, and has the properties described above of corresponding to the original data frame, and with no missing data.)

Here's a gist I put together fiddling with it.

I like user20650's answer did not require additional copying (imputer2 in the gist), so when I started expanding onto a list of 1000, it became significantly faster than the ones that required merging new data frames.

What was kind of quirky and I haven't entirely resolved are that I was that running imputer1 versus imputer2 was producing values that looked identical, but for which a == b were false. I assume a round-off issue.

I'm also still looking for a way to apply general functions like mean or sd over this construct (without copying) rather than computing them itemwise, but anyway my problem is solved and I'll leave that to another question.

Était-ce utile?

La solution

# data
l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# note you can just add dataframes eg
o1 <- (l[[1]] + l[[2]] + l[[3]])/3

# So if you have many df in list - to get the average by summing and dividing by list length
f <- function(x) Reduce("+", x)
o2 <- f(l)/length(l)

all.equal(o1,o2)

Autres conseils

Yet another option, which converts the list l to an array a (using an approach suggested here) and applies mean over the first two dimensions. This assumes all data frames in l have consistent structure. Here I again use @user20650's example list.

l <- list(df1=mtcars[1:5, 1:5], df2=mtcars[1:5, 1:5], df3=mtcars[1:5, 1:5])
a <- array(unlist(l), dim=c(nrow(l[[1]]), ncol(l[[1]]), length(l)), 
           dimnames=c(dimnames(l[[1]]), list(names(l))))
apply(a, 1:2, mean)

                   mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15

Try to merge and then calculate your means:

df <- Reduce(rbind, lapply(mylist, function(df) {
  df$id <- seq_len(nrow(df))
  df
}))
df <- aggregate(. ~ id, df, mean)[, -1]

Example

mylist <- lapply(seq_len(3), function(x) iris[, 1:4] + runif(1, 0, 1))
sapply(seq_len(3), function(i) mylist[[i]][1,1])
# [1] 5.368424 6.097071 5.681132
# Apply above code
head(df)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     5.715542    4.115542     2.015542   0.8155424
# 2     5.515542    3.615542     2.015542   0.8155424
# 3     5.315542    3.815542     1.915542   0.8155424
# 4     5.215542    3.715542     2.115542   0.8155424
# 5     5.615542    4.215542     2.015542   0.8155424
# 6     6.015542    4.515542     2.315542   1.0155424

Note that mean(c(5.368424, 6.097071, 5.681132)) = 5.715542).

Here is an option with mapply:

as.data.frame(mapply(function(a, b) (a + b) / 2, df.lst[[1]], df.lst[[2]]))

This will work for any number of columns. mapply will cycle through each column from each data frame pairwise.

Here is the data we used:

df.lst <- replicate(2, data.frame(var1=runif(10), var2=sample(1:10)), simplify=F)

(i think) Previous answers will fail (certainly my previous does) if some of the variables are different in each of the dataframes or if they are in a different order. A rather horrible function below but it seems to work.

l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# Allow for different variables
l2 <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,2:6], df3 = mtcars[1:5,4:7])

new.f <- function(lst) {
                l <- lst
                un.nm <- unique(unlist(lapply(l , names)))
                o <- lapply(un.nm , function(x) {
                         lapply(l , function(z) {
                               if(x %in% names(z)) z[x] else NA
                          })  
                       })
                # combine for each variable
                l <- lapply(o , function(x) do.call(cbind, x))
                mn <- lapply(l , rowMeans , na.rm=TRUE)
        names(mn) <- lapply(l ,function(i) unique(names(i)[names(i) %in% un.nm]))
               data.frame(do.call(cbind , mn))
          }


all.equal(f(l)/length(l) , new.f(l))

f(l2) # fails
# Error in Ops.data.frame(init, x[[i]]) : 
  #+ only defined for equally-sized data frames

new.f(l2)

EDIT

This example here Join matrices by both colnames and rownames in R offers a much more concise way to do this if there are different columns in each list element.

l <- lapply(l2 , function(i) as.data.frame(as.table(as.matrix(i))))
tmp <- do.call(rbind , l)
tmp <- aggregate(Freq ~ Var1 + Var2, tmp, mean)
xtabs(Freq ~ Var1 + Var2, tmp)

Tested with @user20650's example. The mean of two equal numbers should be the same number.

 as.data.frame( setNames(
         lapply( names(mylist[[1]]), function (nm){
              rowMeans( cbind(mylist[[1]][[nm]], mylist[[2]][[nm]] ) ) }),
         names(mylist[[1]]
        ) ) )
#--------------
   mpg cyl disp  hp drat
1 21.0   6  160 110 3.90
2 21.0   6  160 110 3.90
3 22.8   4  108  93 3.85
4 21.4   6  258 110 3.08
5 18.7   8  360 175 3.15

You read R code from the inside out: For each column name we are using numeric indices to get the dataframes and character indexing to get the columns, which are then 'c-bound' together and passed to rowMeans. This list of rowMean-ed values is then given names with setNames and finally converted to a dataframe.

Note that this does not get all of the dataframes in a list of more than two... only the first two are considered.

> str(mylist)
List of 3
 $ df1:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df2:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df3:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top