Question

I've got two vectors which are TRUE or FALSE. Basically data on households and whether they own a car and whether they have a gold watch. (Note, "car" and "gold watch" are not the actual categories, but they're effective substitutes for this question).

I want to find out the relationships between car ownership and watch ownership and could use some advice for both the stats and the R in terms of which functions to use.

The idea is to be able to say: "If someone has a car, we can say with 95% confidence that there is a 25% chance they have a gold watch"

I've been messing with Cross.Table and assocscats and basically got myself totally confused for what I think is a standard stats question.

Any quick insights into which tests/functions should be used? I've got a correlation of .265, but want to quantify the confidence.

I've looked around a bunch including at: How do I get a contingency table? Contingency table on logistic regression in R with missing fitted values

Thanks!!

Was it helpful?

Solution

you'd be looking to do a logit / probit regression. Look-up on the usage of glm (short hand for general linear models). Within this class of models, you'd need to specify the family as binomial with a link to probit / logit. Type ?glm, ?family to read descriptions on these functions. They handle missing data with the na.action parameter, which may be set to na.pass. The confidence would be estimated coefficient +- standard error of coefficient * critical value

OTHER TIPS

Here's a shot at the details, use at your own risk. I'm no glm expert, but there are a few here on and maybe they'll be kind enough to point out any problems, etc.:

# reproducible data
set.seed(2)
car <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
watch <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))

# inspect data
(mytable <- table(car,watch))
       watch
car     FALSE TRUE
  FALSE   247  250
  TRUE    254  249
summary(mytable)
Number of cases in table: 1000 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 0.06381, df = 1, p-value = 0.8006
# variables are probably not independent 

# reshape for glm
(mydf <- as.data.frame(mytable))
    car watch Freq
1 FALSE FALSE  247
2  TRUE FALSE  254
3 FALSE  TRUE  250
4  TRUE  TRUE  249

Model as suggested by Aditya Sihag:

summary(glmlp <- glm(watch ~ car, data = mydf, family=binomial(link=logit)))
Call:
glm(formula = watch ~ car, family = binomial(link = logit), data = mydf)

Deviance Residuals: 
     1       2       3       4  
-1.177  -1.177   1.177   1.177  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.110e-16  1.414e+00       0        1
carTRUE     -2.220e-16  2.000e+00       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5.5452  on 3  degrees of freedom
Residual deviance: 5.5452  on 2  degrees of freedom
AIC: 9.5452

Number of Fisher Scoring iterations: 2

Useful pages for more details on glm:

http://www.ats.ucla.edu/stat/r/dae/probit.htm

http://data.princeton.edu/R/glms.html

https://stat.ethz.ch/pipermail/r-help/2007-March/126891.html

Another approach could use resampling to get confidence intervals on the elements in the 2x2 contingency table:

set.seed(2)
car <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
watch <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))

library(boot)
b <- boot(data.frame(car,watch), function(d,i) { table(d[i,]) }, 1000)
boot.ci(b, index=4, type="basic")


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = b, type = "basic", index = 4)

Intervals : 
Level      Basic         
95%   (222, 276 )  
Calculations and Intervals on Original Scale

So, the 95% CI on the probability of having a watch and a car = 0.22,0.28.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top