you'd be looking to do a logit / probit regression. Look-up on the usage of glm (short hand for general linear models). Within this class of models, you'd need to specify the family as binomial with a link to probit / logit. Type ?glm, ?family to read descriptions on these functions. They handle missing data with the na.action parameter, which may be set to na.pass. The confidence would be estimated coefficient +- standard error of coefficient * critical value
statistics contingency R [closed]
-
31-03-2022 - |
Question
I've got two vectors which are TRUE or FALSE. Basically data on households and whether they own a car and whether they have a gold watch. (Note, "car" and "gold watch" are not the actual categories, but they're effective substitutes for this question).
I want to find out the relationships between car ownership and watch ownership and could use some advice for both the stats and the R in terms of which functions to use.
The idea is to be able to say: "If someone has a car, we can say with 95% confidence that there is a 25% chance they have a gold watch"
I've been messing with Cross.Table and assocscats and basically got myself totally confused for what I think is a standard stats question.
Any quick insights into which tests/functions should be used? I've got a correlation of .265, but want to quantify the confidence.
I've looked around a bunch including at: How do I get a contingency table? Contingency table on logistic regression in R with missing fitted values
Thanks!!
Solution
OTHER TIPS
Here's a shot at the details, use at your own risk. I'm no glm
expert, but there are a few here on r and maybe they'll be kind enough to point out any problems, etc.:
# reproducible data
set.seed(2)
car <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
watch <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
# inspect data
(mytable <- table(car,watch))
watch
car FALSE TRUE
FALSE 247 250
TRUE 254 249
summary(mytable)
Number of cases in table: 1000
Number of factors: 2
Test for independence of all factors:
Chisq = 0.06381, df = 1, p-value = 0.8006
# variables are probably not independent
# reshape for glm
(mydf <- as.data.frame(mytable))
car watch Freq
1 FALSE FALSE 247
2 TRUE FALSE 254
3 FALSE TRUE 250
4 TRUE TRUE 249
Model as suggested by Aditya Sihag:
summary(glmlp <- glm(watch ~ car, data = mydf, family=binomial(link=logit)))
Call:
glm(formula = watch ~ car, family = binomial(link = logit), data = mydf)
Deviance Residuals:
1 2 3 4
-1.177 -1.177 1.177 1.177
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.110e-16 1.414e+00 0 1
carTRUE -2.220e-16 2.000e+00 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5.5452 on 3 degrees of freedom
Residual deviance: 5.5452 on 2 degrees of freedom
AIC: 9.5452
Number of Fisher Scoring iterations: 2
Useful pages for more details on glm
:
http://www.ats.ucla.edu/stat/r/dae/probit.htm
http://data.princeton.edu/R/glms.html
https://stat.ethz.ch/pipermail/r-help/2007-March/126891.html
Another approach could use resampling to get confidence intervals on the elements in the 2x2 contingency table:
set.seed(2)
car <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
watch <- as.factor(sample(c("TRUE","FALSE"), 1000, replace=TRUE))
library(boot)
b <- boot(data.frame(car,watch), function(d,i) { table(d[i,]) }, 1000)
boot.ci(b, index=4, type="basic")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b, type = "basic", index = 4)
Intervals :
Level Basic
95% (222, 276 )
Calculations and Intervals on Original Scale
So, the 95% CI on the probability of having a watch and a car = 0.22,0.28.