Use aggregate with a function that uses data from two columns (e.g. cov or prod)

https://stackoverflow.com/questions/3823549

26-09-2019
|

Question

I have a long time series of daily data and 101 columns. Each month I would like to calculate the cov of each of the first 100 columns with the 101st column. This would generate a monthly covariance with the 101st column for each of the 100 columns based on daily data. It seems that aggregate does what I want with functions that take a single vector, such as mean, but I can't get it to work with cov (or prod).

Please let me know if a dput of a few months would help.

> library("zoo")
> data <- read.zoo("100Size-BM.csv", header=TRUE, sep=",", format="%Y%m%d")
> head(data[, c("R1", "R2", "R3", "R100", "Mkt.RF")])
                 R1       R2       R3     R100  Mkt.RF
1963-07-01 -0.00212  0.00398 -0.00472 -0.00362 -0.0066
1963-07-02 -0.00242  0.00678  0.00068 -0.00012  0.0078
1963-07-03  0.00528  0.01078  0.00598  0.00338  0.0063
1963-07-05  0.01738 -0.00932 -0.00072 -0.00012  0.0040
1963-07-08  0.01048 -0.01262 -0.01332 -0.01392 -0.0062
1963-07-09 -0.01052  0.01048  0.01738  0.01388  0.0045

mean works great, and gives me the monthly data I want.

> mean.temp <- aggregate(data[, 1:100], as.yearmon, mean)
> head(mean.temp[, 1:3])
                    R1            R2            R3
Jul 1963  0.0003845455  7.545455e-05  0.0004300000
Aug 1963 -0.0006418182  2.412727e-03  0.0022263636
Sep 1963  0.0016250000  1.025000e-03 -0.0002600000
Oct 1963 -0.0007952174  2.226522e-03  0.0004873913
Nov 1963  0.0006555556 -5.211111e-03 -0.0013888889
Dec 1963 -0.0027066667 -1.249524e-03 -0.0005828571

But I can't get a function that uses two different columns/vectors to work.

> cov.temp <- aggregate(data[, 1:100], as.yearmon, cov(x, data[, "Mkt.RF"]))
Error in inherits(x, "data.frame") : object 'x' not found

Nor can I get it work making a cov wrapper.

> f <- function(x) cov(x, data[, "Mkt.RF"])
> cov.temp <- aggregate(data[, 1:100], as.yearmon, f)
Error in cov(x, data[, "Mkt.RF"]) : incompatible dimensions

Should I do this with a for loop? I am hoping there is a more R way. Thanks!

Solution

You forgot the function(x) declaration and you need to make sure you get the correct monthly subset of data (which stomps on the data function, by the way). Try this:

> aggregate(data, as.yearmon, function(x) cov(x,data[index(x),"Mkt.RF"]))
                  R1         R2        R3        R100     Mkt.RF
Jul 1963  1.3265e-05 2.0340e-05 3.464e-05  2.2575e-05  6.267e-05
Aug 1963 -7.1295e-05 2.8875e-05 1.000e-06 -9.9700e-06 -2.608e-05

*Note that I changed the last three observations in your example data to the month of August, so there would be more than one month of output.

OTHER TIPS

You can use the approach I wrote here, namely to do something like:

tapply(1:nrow(data), data$group, function(s) cov(data$x[s], data$y[s]))

In aggregate(), as is common to many R functions that apply another R functions to subsets of data, you name the function you want to apply, in this case by adding FUN = cov to your aggregate() call. You then can supply arguments to this function as part of the ... special argument.

You can to pass data[, "Mkt.RF"]) as argument y of function cov(), so something like this should work:

cov.temp <- aggregate(data[, 1:100], as.yearmon, FUN = cov, y = data[, "Mkt.RF"])

However, in this case, this doesn't appear to work as you need to handle the zoo-nature of the data and be able to subset data[, "Mkt.RF"] in the same way the other data[,1:100]1 columns are broken up byaggregate()`. So an alternative is to specify a function inline, like this:

cov.temp <- aggregate(data[, 1:100], as.yearmon, 
                      FUN = function(x) cov(x, y = data[index(x), "Mkt.RF"]))

Here is an example that should run out of the box:

library("zoo")
dat <- data.frame(matrix(rnorm(365*10*6), ncol = 6))
Dates <- seq.Date(from = as.Date("1963-07-01"), by = "days", length = 365*10)
dat2 <- zoo(dat, order.by = Dates)

Which gives us:

> head(dat2)
                    X1         X2         X3          X4         X5         X6
1963-07-01  0.30910867  0.5539864  0.6433690  0.20608146 -1.7706003 -0.4607610
1963-07-02 -0.02519616 -0.1856305  1.0419578  1.01319153  0.8671110  0.1196251
1963-07-03  1.56464024  0.4980238  0.2976338  0.05654036  0.4984225 -1.4626501
1963-07-04 -0.24028698 -1.4365257  0.5707873 -0.05851961 -0.7176343  0.1233137
1963-07-05 -0.87770815 -0.5217949 -2.4875626 -0.08200408 -0.6121038 -0.3881126
1963-07-06 -0.53660576 -1.1098966  2.7411511 -1.37106883 -0.5891641  1.6322411

Now, lets assume X6 is your "Mkt.RF" column and we'll aggregate over dat2[,1:5]:

cov.temp <- aggregate(dat2[, 1:5], as.yearmon, 
                      FUN = function(x) cov(x, y = dat2[index(x),"X6"]))
head(cov.temp)

Which yields:

> head(cov.temp)
                  X1          X2           X3          X4          X5
Jul 1963 -0.30185387  0.09802210  0.019282934 -0.03621272  0.05332324
Aug 1963  0.14739044  0.04276340  0.081847499 -0.35195736 -0.14680017
Sep 1963  0.56698393 -0.08371676  0.003870935 -0.05948173  0.07550769
Oct 1963  0.00711595 -0.07939798  0.118030943 -0.22065278 -0.12474052
Nov 1963  0.06551982  0.22848268  0.231967655  0.02356194 -0.24272566
Dec 1963  0.23866775  0.29464398 -0.034313793  0.09694199 -0.10481527

HTH

I ended up using aggregate to format the data, but it took about 50 min per calculation of cov with each factor. On a whim I tried the plyr solution, which has huge gains.

cov.fn <- function(x) nrow(x) * cov(x[, 1:100], x[, 101])
temp <- zoo(daply(data, .(as.yearmon(index(data))), cov.fn), unique(as.yearmon(index(data))))

This takes about five sec (600x times faster). I guess there are big speed gains to be had in improving the efficiency of subsetting operations.

Thanks, all, for the help. I learned a lot on this one.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow