Question

I usually use the combination of colwise and tapply to calculate grouped values in a data frame. However, I found unexpectedly that the parameter FUN in tapply cannot work correctly with colwise from plyr. The example is as follows:

Data:

df <- data.frame(a = 1:10, b = rep(1:2, each = 5), c = 2:11)

Normal:

library(plyr)
colwise(tapply)(subset(df, select = c(a, c)), df$b, function(x){sum(x[x > 2])})

Above code is correct and can work normally. But if I add FUN, it will be wrong:

colwise(tapply)(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})

Error is:

Error in FUN(X[[1L]], ...) : 

unused arguments (function (X, INDEX, FUN = NULL, ..., simplify = TRUE) 
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    if (!nI) stop("'INDEX' is of length zero")
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx <- length(X)
    one <- 1
    group <- rep.int(one, nx)
    ngroup <- one
    for (i in seq_along(INDEX)) {
        index <- as.factor(INDEX[[i]])
        if (length(index) != nx) stop("arguments must have same length")
        namelist[[i]] <- levels(index)
        extent[i] <- nlevels(index)
        group <- group + ngroup * (as.integer(index) - one)
        ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(X = split(X, group), FUN = FUN, ...)
    index <- as.integer(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1)) {
        ansmat <- array(dim = extent, dimnames = namelist)

Could anyone explain the reason? Thank you in advance.

Was it helpful?

Solution

Well, the issue is that both lapply and tapply have an optional FUN argument. Note that colwise(tapply) is a function with the following line:

out <- do.call("lapply", c(list(filtered, .fun, ...), dots))

Let's go to this line with our debugger by writing

ct <- colwise(tapply); trace(ct, quote(browser()), at = 6)

and then running

ct(subset(df, select = c(a, c)), df$b, FUN = function(x){sum(x[x > 2])})

Now let's print c(list(filtered, .fun, ...), dots). Notice that the first three (unnamed) arguments are now the dataframe, tapply, and db$b, with the FUN argument above coming in last. However, this argument is named. Since this is a do.call on lapply, instead of that argument becoming an optional parameter for tapply, it now becomes the main call on lapply! So what is happening is that you are turning this into:

lapply(subset(df, select = c(a, c)), function(x){sum(x[x > 2])}, tapply, df$b)

This, of course, makes no sense, and if you execute the above (still in your debugger) manually, you will get the exact same error you are getting. For a simple workaround, try:

tapply2 <- function(.FUN, ...) tapply(FUN = .FUN, ...)
colwise(tapply2)(subset(df, select = c(a, c)), df$b, .FUN = function(x){sum(x[x > 2])})

The plyr package should be checking for ... arguments named FUN (or anything that can interfere with lapply's job), but it doesn't seem the author included this. You can submit a pull request to the plyr package that implements any of the following workarounds:

Defines a local

.lapply <- function(`*X*`, `*FUN*`, ...) lapply(X = `*X*`, `*FUN*`, ...)

(minimizing interference further).

Checks names(list(...)) within the colwise(tapply) function for X and FUN (can introduce problems if the author intended to prevent evaluation of promises until the child call).

Calls do.call("lapply", ...) explicitly with named X and FUN, so that you get the intended

formal argument "FUN" matched by multiple actual arguments
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top