Question

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.

I narrowed the data set down into a matrix of the actual variables of interest with:

>MMatrix = MMatrix2[1:2187,4:10]

This worked fine for calculating a overall covariance matrix with:

>cov(MMatrix, use="pairwise.complete.obs",method="pearson")

So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:

>CovDataM <- as.data.frame(MMatrix)

I then used the following suggested code to get covariances by group, but it keeps returning NULL:

>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))

I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.

Was it helpful?

Solution 2

Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):

CovData <- matrix(1:75, 15) 
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))

colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))

You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:

by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")

It seems to work fine. Hopefully that generalizes to your problem.

OTHER TIPS

Here is an example that should get you going:

# Create some fake data
m <- matrix(runif(6000), ncol=6, 
            dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))

# Insert random NAs
m[sample(6000, 500)] <- NA

# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))

# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')

The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top