R - lm and r squared

https://stackoverflow.com/questions/10930548

13-06-2021
|

Question

I have been running through the questions on this site and others and I just want to make sure I understand that I am going this correctly and then I will want some advice to analyse the results.

I am exporting a m by n binary matrix from Java to R (using jri) and then I want to run lm() against an expected vector of 0s.

Here is the export function for getting the matrix into R

REXP x = re.eval("selectionArray <- c()");

for (int j = 0; j < currentSelection.length; j++){
    boolean result = re.assign("currentSNPs", currentSelection[j]);
        if (result == true){
             x = re.eval("selectionArray <- rbind(selectionArray, currentSNPs)");
        }
}

So then I want to execute the the lm() function to get the r squared values

            x = re.eval("fm = lm(selectionArray ~ 0)");

I know that I need to use summary(fm) at this point to get the r squared values but I am not sure how to pull them out or what they mean at this point. I want to know the significance of the deviation from the expected 0 value at each column.

Thanks!

Solution

to extract the R^2 value from an 'lm' object named 'm'

summary(m)$r.squared

you can always view the structure of an object in R by using the str() function; in this situation you want str(summary(m))

However, it's not clear what you're trying to accomplish here. In the formula argument of the lm() function you specify selectionArray ~ 0, which doesn't make sense for two reasons: 1) as previously hinted at, a 0 on the right side of the formula corresponds to a model where your predictor variable is a vector of zeros and the beta coefficient corresponding to this predictor cannot be defined. 2) Your outcome, selectionArray, is a matrix. As far as I know, lm() isn't set up to have multiple outcomes.

Are you attempting to test the significance that each column of selectionArray differs from a 0? If so, ANY column with at least one success (1) in it is significantly different from a 0 column. If you're interested in the confidence intervals for the probability of success in each column, use the following code. Note that this does not adjust for multiple comparisons.

First let's start with a toy example to demonstrate the concept

v1 <- rbinom(100,size=1,p=.25)  
#create a vector, length 100, 
#where each entry corresponds to the 
#result of a bernoulli trial with probability p

binom.test(sum(v1), n=length(v1), p = 0)
##let's pretend we didn't just generate v1 ourselves, 
##we can use binom.test to determine the 95% CI for p

#now in terms of what you want to do...
#here's a dataset that might be something like yours:
selectionArray <- sapply(runif(10), FUN=function(.p) rbinom(100,size=1,p=.p))
#I'm just generating 10 vectors from a binomial distribution 
#where each entry corresponds to 1 trial and each column 
#has a randomly generated p between 0 and 1

#using a for loop
#run a binomial test on each column, store the results in binom.test.results
binom.test.results <- list()
for(i in 1:ncol(selectionArray)){
    binom.test.results[[i]] <- binom.test(sum(selectionArray[,i]), 
        n=nrow(selectionArray), p=0)
}

#for loops are considered bad programming in r, so here's the "right" way to do it:
binom.test.results1 <- lapply(as.data.frame(selectionArray), function(.v){
    binom.test(sum(.v), n=nrow(selectionArray), p = 0)
})

#using str() on a single element of binom.test.result will help you 
#identify what results you'd like to extract from each test

OTHER TIPS

I don't know much about Java, so I am not talking about that.

So, you 've got a matrix of 0 and 1 values and no other binary numbers?

And you want to know if the means of the columns are signif. different from 0?

That means, you should do hypothesis tests and not necessarily a regression. However a regression can be equivalent to such a test.

lm(y~0) does not make sense. If you only want an intercept you should use lm(y~1). However, that would be equivalent to a t-Test, which is not statistically correct.

I suspect it would be better to use fit<-glm(y~1,family=binomial) and than extract the p-value p<-summary(fit)$coef[4], but I am not a statistician.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow