How to calculate marginal probabilities for generating correlated binary variables

StackOverflow https://stackoverflow.com/questions/18872628

  •  29-06-2022
  •  | 
  •  

Let's say I want to create 50 binary variables of length 100 that are each correlated with each other.

For I create a correlation matrix with the specified rho's:

cor.mat <- matrix(0.2,nrow=50, ncol=50)
diag(cor.mat) <- 1

next I use rmvbin:

library(bindata)
rmvbin<-rmvbin(100, margprob=x, bincorr=cor.mat)

However, I'm not sure how to calculate the margprob argument. Can someone help?

SHould it be a vector of the sum of probabilities in each row and column?

有帮助吗?

解决方案

margprob should simply be a repeated vector of the probability that any single binary variable is 1, independent of the rest; call this value p. Assuming identically distributed variables (which given your correlation matrix seems to be the case), margprob=rep(p,50).

It should NOT be a vector of the sum of probabilities in each row and column, as the correlation matrix cannot be used to determine the marginal probabilities. If you're having trouble figuring out what the marginal probabilities for your random variables are, you will have to give more context for the problem and it would be a question more appropriate for math.stackexchange.com.

其他提示

I think the problem has been that people saw the solution as being either too simple or not properly specified. You don't actually calculate the marginal probabilities ... you specify them. Then the rmvbin function uses the specification of the marginal probabilities and the joint correlations to do the needed sampling to (on average) give joint distributions that match those specs.

library(bindata)
rmvbin<-rmvbin(100, margprob=rep(.5,50), bincorr=cor.mat)

> str(rmvbin)
 num [1:100, 1:50] 0 0 0 1 0 0 0 1 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL

So to look at the sampling characteristics of this result, you can see what correlation there is with the first column:

Hmisc::describe(apply(rmvbin[,-1], 2, function(col) cor(col, rmvbin[,1]) ) )
apply(rmvbin[, -1], 2, function(col) cor(col, rmvbin[, 1])) 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
     49       0      38  0.2009 0.05886 0.09874 0.13309 0.19372 0.25208 0.29723 
    .95 
0.33772 

lowest : 0.03508 0.04013 0.08696 0.09874 0.10889
highest: 0.29942 0.32450 0.34653 0.40902 0.46714 

So the average correlation under sampling was pretty close to the nominal value 0.2. but it did vary fairly widely.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top