Question

I have been able to calculate covariance for my large data set with:

cov(MyMatrix, use="pairwise.complete.obs",method="pearson")

This provided the covariance table I was looking for, as well as dealing with the NA issues that are throughout my data. For a deeper analysis, however, I want to create covariance matrices that deal separately with the 800+ groups I have in my data set (some have 40+ observations, others only 1). I tried (from http://www.mail-archive.com/r-help@r-project.org/msg86328.html):

lapply(list(cov), by, data = MyMatrix[8:13], INDICES = MyMatrix["Group"])

Which gave me the following error:

Error in tapply(seq_len(6L), list(MyMatrix["Group"] = NA_real_), function (x) : arguments must have same length

This made me think the issue with the code involved the missing NA data, so I tried incorporating the "use="pairwise.complete.obs",method="pearson"" phrase into the lapply code and can't get it to work. I'm not sure the best place for it, so I tried sticking it everywhere:

lapply(list(cov), use="pairwise.complete.obs",method="pearson"),by,data=MyMatrix[8:13], INDICES = MyMatrix["Group"])

lapply(list(cov),by,data=PhenoMtrix[8:13], INDICES = PhenoMtrix["Group"], use="pairwise.complete.obs",method="pearson")

This is obviously sloppy and doesn't work, so I'm a little stuck. Thanks in advance for your help!

My data is formatted as such:

Group HML RML FML TML FHD BIB

 1      323.50    248.75     434.50    355.75    46.84    NA

 2        NA      238.50     441.50    353.00    45.83    277.0

 2      309.50    227.75     419.00    332.25    46.39    284.0
Was it helpful?

Solution 2

This would be much better if you provided an example of your data (or all of it), but since you didn't,

# create sample data
set.seed(1)
MyMatrix <- data.frame(group=rep(1:5, each=100),matrix(rnorm(2500),ncol=5))
# generate list of covariance matrices by group
cov.list <- lapply(unique(MyMatrix$group),
                   function(x)cov(MyMatrix[MyMatrix$group==x,-1],
                                  use="na.or.complete"))
cov.list[1]
# [[1]]
#             X1          X2          X3          X4          X5
# X1  0.80676209 -0.09541458 -0.12704666 -0.04122976  0.08636307
# X2 -0.09541458  0.93350463 -0.05197573 -0.06457299 -0.02203141
# X3 -0.12704666 -0.05197573  1.06030090  0.07324986  0.01840894
# X4 -0.04122976 -0.06457299  0.07324986  1.12059428  0.02385031
# X5  0.08636307 -0.02203141  0.01840894  0.02385031  1.11101410

In this example we create a dataframe called MyMatrix with a six columns. The first is group and the other five are X1, X2, ... X5 and contain the data we wish to correlate. Hopefully, this is similar to the structure of your dataset.

The operative line of code is:

cov.list <- lapply(unique(MyMatrix$group),
                   function(x)cov(MyMatrix[MyMatrix$group==x,-1],
                                  use="na.or.complete"))

This takes a list of group id's (from unique(MyMatrix$group)) and calls the function with each of them. The function calculates the covariance matrix for all columns of MyMatrix except the first, for all rows in the relevant group, and stores the results in a 5-element list (there are 5 groups in this example).

Note: Regarding how to deal with NA. There are actually several options; you should review the documentation on ?cov to see what they are. The method chosen here, use="na.or.complete" includes in the calculation only rows which have no NA values in any of the columns. If, for a given group, there are no such rows, cov(...) returns NA. There are several other choices though.

OTHER TIPS

you can also try:

  by(MyMatrix[-1],MyMatrix$group,cov,use="na.or.complete")

You can also first turn your dataframe into a list, and then use lapply to run the cov function on that list structure, and return a list of covariance matrices for each group.

Your example dataframe was too small to answer your question, so I used similar sample data by @jlhoward and some of your column names:

#Create sample dataframe and rename the columns based on the initial question 
MyDataframe <- data.frame(group=rep(1:5, each=10),matrix(rnorm(250),ncol=5))
colnames(MyDataframe) <- c("Group", "HML", "RML", "FML", "TML", "FHD")

#Split the dataframe columns HML, RML, FML, TML, and FHD into lists based on group membership, and call the new list MyList
MyList <- split(MyDataframe[ ,c("HML", "RML", "FML", "TML", "FHD")], list(Group = MyDataframe$Group))
> MyList
$`1`
          HML        RML        FML         TML         FHD
1   1.6806547 -1.2357861 -0.1438550 -1.79852015 -0.18745361
2  -1.2750024 -1.0973354  0.8654817  0.51666643  1.23240278
3   0.2381941 -1.1605690 -1.1124618 -1.24223216 -0.47014275
4  -0.6592671  0.6749256  0.3744053  2.82355336  0.04349764
5  -1.2026018  2.2036865 -0.6543408  0.05235647 -0.88794230
6   0.7946254  1.9786356 -0.1276282  0.37147386  2.23512260
7  -0.3166249 -1.0072974 -2.0800837 -0.31275558  0.88379182
8  -0.1662388  1.3819116 -3.1629656 -0.86033274 -0.31272981
9  -0.4666707 -0.8104205  1.0934703 -0.02459932 -0.35725108
10 -0.8385697  1.7204379 -0.3447757  0.18629448  0.42084553

$`2`
           HML         RML        FML           TML        FHD
11  0.03604007  0.58921306 -0.2066693 -1.0887154121  0.2790660
12  0.81767599  0.08703872 -0.1476078  0.1261011136  0.3525258
13 -0.19341506  0.31941568 -1.2553003  0.2419955263 -0.3152117
14  0.36065670 -0.77353050 -0.2166640  0.0001615059  0.7663386
15 -1.62885990  0.18124576  0.8299511 -1.0140332552 -0.9668448
16 -0.44847189 -1.57839214 -0.6470409 -0.0612936448 -0.3844145
17 -0.11144444 -0.65229817  0.6505128 -0.0882344334 -0.3144284
18  0.74339324  1.78857053 -1.2333200 -0.9063703037 -0.0765000
19  1.51958249 -0.56289571  0.2964601 -0.0287684624  0.3151081
20 -1.56974385  0.28559655 -2.7583618  0.3632164248 -0.1410783

$`3`
          HML         RML        FML         TML         FHD
21  1.8902244 -1.32617251 -0.3473238 -0.14714488 -0.20950269
22  0.7233421  1.87021160  0.5498787 -0.21878322 -0.25967403
23 -0.4488791  0.40916110 -0.2716354  0.68897421 -0.87347369
24 -0.4013050  0.41924705  0.6404477  0.81811788  1.24055660
25  1.1542181  0.75534163  0.1067173  0.32427043 -0.85858957
26 -1.3252742 -0.09989574  1.4557291 -0.62678378  0.04029924
27  0.2694684 -0.16238724 -0.6138011  0.07998383 -0.78157860
28 -0.8149025  0.77406215 -0.6921972  0.21223283 -0.86679556
29 -0.4916411 -0.80898776 -0.9372076 -1.44085453  1.18841866
30  0.3670508  1.45821533 -1.2531432  0.23593131 -1.17231457

$`4`
           HML        RML        FML         TML         FHD
31 -0.91753704  1.5976080  1.9286179 -0.88697107  0.85215534
32  0.57087719  1.2202687  0.5791964  1.98994106  0.68640384
33  0.79562327  1.0253044  0.5356456  0.31906648 -1.06342199
34 -0.06380725 -0.5774832  0.7260138 -0.93905123  1.88579741
35  0.24285367 -1.3862499  0.2853635 -1.27603774  0.07991027
36  1.15532419  0.4545112  0.3121971 -0.80544639 -0.74762482
37  1.30120698  1.3480126 -0.1012468  0.03093374 -0.74170584
38 -0.04423831  0.9100061 -2.1983937 -0.88974443  0.50814835
39 -1.71264891 -0.1225082  0.5095046 -1.28680921 -0.37710894
40 -0.11079800 -0.6806858 -0.9002725 -0.70797874  0.49889563

$`5`
          HML         RML        FML        TML         FHD
41 -0.6549724  0.77703431 -0.7953904 -0.7044253  0.73765368
42 -2.3945883  1.16952896  1.8286481 -0.8116904 -0.59562563
43  1.3470786 -0.26396886  0.3858448 -0.1839417  0.66618305
44 -0.4450848  0.71092152 -0.7665068 -0.1213066 -1.33159041
45  0.2621206 -0.05290252 -0.2817160 -1.1119020  0.53377605
46 -1.8713943 -0.82580895  0.5590292  0.5474239  1.85929122
47  2.1826177 -1.88918691 -0.2495949 -0.7371631  1.33998290
48 -0.2294448  1.04252185 -1.3311849  0.0447891  0.48173560
49  0.2250941 -0.37240902  1.1648265 -0.4848731 -0.06271555
50 -0.4131518 -0.94258989 -0.3291930 -1.7198636 -0.80485465

#Compute the covariance matrix for each group in MyList, and here I specify the columns, which is always good practice
within.group.cov <- lapply(MyList, function(x) cov(x[c("HML", "RML", "FML", "TML", "FHD")], use="pairwise.complete.obs"))
> within.group.cov
$`1`
           HML        RML        FML        TML       FHD
HML  0.8421194 -0.3185286 -0.1532037 -0.6192220 0.1086547
RML -0.3185286  2.1271906 -0.3640053  0.5995860 0.1025134
FML -0.1532037 -0.3640053  1.7151420  0.6879402 0.2133205
TML -0.6192220  0.5995860  0.6879402  1.5581052 0.2910293
FHD  0.1086547  0.1025134  0.2133205  0.2910293 0.8965167

$`2`
           HML         RML         FML         TML         FHD
HML 1.00082711  0.02754603  0.28260061  0.03354966  0.33759313
RML 0.02754603  0.84357799 -0.32547765 -0.24018033 -0.02572972
FML 0.28260061 -0.32547765  1.13751383 -0.22229766 -0.03210768
TML 0.03354966 -0.24018033 -0.22229766  0.29451851  0.06511133
FHD 0.33759313 -0.02572972 -0.03210768  0.06511133  0.24129973

$`3`
            HML         RML         FML         TML        FHD
HML  0.95281484 -0.06072793 -0.18608614  0.08682029 -0.2241560
RML -0.06072793  0.94498926  0.05826306  0.26708137 -0.3414530
FML -0.18608614  0.05826306  0.68405108  0.02649167  0.2239994
TML  0.08682029  0.26708137  0.02649167  0.43268293 -0.2285798
FHD -0.22415605 -0.34145303  0.22399939 -0.22857978  0.7388127

$`4`
           HML        RML         FML         TML         FHD
HML  0.8545644  0.2010881 -0.18227287  0.43631519 -0.31002651
RML  0.2010881  1.0269461  0.16025720  0.53798506 -0.20676662
FML -0.1822729  0.1602572  1.18667903  0.11072818  0.07572155
TML  0.4363152  0.5379851  0.11072818  0.99720184 -0.07098033
FHD -0.3100265 -0.2067666  0.07572155 -0.07098033  0.82212198

$`5`
           HML         RML         FML         TML         FHD
HML  1.8208574 -0.73730457 -0.43570215 -0.13051837  0.30984815
RML -0.7373046  0.98584765 -0.06674645  0.10796767 -0.43046406
FML -0.4357021 -0.06674645  0.93344577 -0.00659714 -0.03835305
TML -0.1305184  0.10796767 -0.00659714  0.40967714  0.26304517
FHD  0.3098481 -0.43046406 -0.03835305  0.26304517  0.97107321
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top