rates$counts <- rates$N*rates$Rate
rates$counts <- round(rates$counts,0)
rates
#----------
Black White N Rate counts
1 TRUE FALSE 512 0.2344 120
2 FALSE TRUE 529 0.2098 111
3 TRUE TRUE 495 0.1919 95
4 FALSE FALSE 510 0.1882 96
> rates$failures <-rates$N -rates$counts s
> glm(cbind(counts,failures)~Black*White, data=rates, family="binomial")
Call: glm(formula = cbind(counts, failures) ~ Black * White, family = "binomial",
data = rates)
Coefficients:
(Intercept) BlackTRUE WhiteTRUE
-1.4615 0.2777 0.1356
BlackTRUE:WhiteTRUE
-0.3894
Degrees of Freedom: 3 Total (i.e. Null); 0 Residual
Null Deviance: 4.104
Residual Deviance: -7.461e-14 AIC: 33.05
Elegantly convert rate summary rows into long binary-response rows?
-
28-06-2022 - |
Pergunta
Background: I am running a little A/B test, with 2x2 factors (foreground's black and background's white, off-color vs normal color), and Analytics reports the number of hits for each of the 4 conditions and at what rate they 'converted' (a binary variable, which I define as spending at least 40 seconds on page). It's easy enough to do a little editing and get in a nice R dataframe:
rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,FALSE,512,0.2344
FALSE,TRUE,529,0.2098
TRUE,TRUE,495,0.1919
FALSE,FALSE,510,0.1882
Naturally, I'd like to look at a logistic regression on something like Rate ~ Black * White
but R's glm
wants a dataframe of 2046 rows each reporting a TRUE
or FALSE
conversion value & the values of Black
and White
. This... is a little more tricky. I googled around and checked SO but while I found some clunky code on how to convert a table of contingency counts to a dataframe, I didn't find anything about percentages/rates.
After a lot of trouble, I came up with a loop over the 4 conditions in which I repeat a dataframe rate * n
times with the relevant condition values and the result True
and then do the same thing but for (1 - rate) * n
and the result False
, and then stitch together all 8 dataframes into one giant dataframe:
ground <- NULL
for (i in 1:nrow(rates)) {
x <- rates[i,]
y <- do.call("rbind", replicate((x$N * x$Rate), data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(TRUE)), simplify = FALSE))
z <- do.call("rbind", replicate((x$N * (1-x$Rate)), data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(FALSE)), simplify = FALSE))
ground <- rbind(ground,y,z)
}
The resulting dataframe ground
looks right:
sum(rates$N)
[1] 2046
nrow(ground)
[1] 2042
# the missing 4 are probably from the rounding-off of the reported conversion rate
summary(ground); head(ground, n=20)
Black White Conversion
Mode :logical Mode :logical Mode :logical
FALSE:1037 FALSE:1020 FALSE:1623
TRUE :1005 TRUE :1022 TRUE :419
NA's :0 NA's :0 NA's :0
Black White Conversion
1 TRUE FALSE TRUE
2 TRUE FALSE TRUE
3 TRUE FALSE TRUE
4 TRUE FALSE TRUE
5 TRUE FALSE TRUE
6 TRUE FALSE TRUE
7 TRUE FALSE TRUE
8 TRUE FALSE TRUE
9 TRUE FALSE TRUE
10 TRUE FALSE TRUE
11 TRUE FALSE TRUE
12 TRUE FALSE TRUE
13 TRUE FALSE TRUE
14 TRUE FALSE TRUE
15 TRUE FALSE TRUE
16 TRUE FALSE TRUE
17 TRUE FALSE TRUE
18 TRUE FALSE TRUE
19 TRUE FALSE TRUE
20 TRUE FALSE TRUE
And likewise, the logistic regression spits out a sane-looking answer:
g <- glm(Conversion ~ Black*White, family=binomial, data=ground); summary(g)
...
Deviance Residuals:
Min 1Q Median 3Q Max
-0.732 -0.683 -0.650 -0.643 1.832
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.472 0.114 -12.94 <2e-16
BlackTRUE 0.291 0.154 1.88 0.060
WhiteTRUE 0.137 0.156 0.88 0.381
BlackTRUE:WhiteTRUE -0.404 0.220 -1.84 0.066
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2072.7 on 2041 degrees of freedom
Residual deviance: 2068.2 on 2038 degrees of freedom
AIC: 2076
Number of Fisher Scoring iterations: 4
So my question is: is there any more elegant way of turning my Analytics's rate data into glm
input than that awful loop?
Solução
Outras dicas
One thing is how to convert your data. Another is why. From ?glm
: "[f]or binomial [...] famil[y] the response can [...] be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures.". The first way corresponds to your "R's glm wants a dataframe of 2046 rows each reporting a TRUE or FALSE conversion". The second way basically corresponds to the your original data set where the "successes" easily can be calculated from Rate and N. A third way would be to use the proportion of successes per treatment combination as response variable, in which case
the number of trials must be supplied as the weights
argument.
set.seed(1)
# one row per observation
df1 <- data.frame(x = sample(c("yes", "no"), 40, replace = TRUE),
y = sample(c("yes", "no"), 40, replace = TRUE),
z = rbinom(n = 40, size = 1, prob = 0.5))
df1
library(plyr)
# aggregated data with one row per treatment combination
df2 <- ddply(.data = df1, .variables = .(x, y), summarize,
n = length(z),
rate = sum(z)/n,
success = n*rate,
failure = n - success)
df2
# three different ways to specify the models,
# which all give the same parameter estimates for x, y and x*y
mod1 <- glm(z ~ x * y, data = df1, family = binomial)
mod2 <- glm(cbind(success, failure) ~ x * y, data = df2, family = binomial)
mod3 <- glm(rate ~ x * y, data = df2, weights = n, family = binomial)
summary(mod1)
summary(mod2)
summary(mod3)
Not quite clear what you're converting, but if all you need is n
rows for each value in column N
, then
EDIT -- I was very sloppy. First thing- convert all factors in your original file to numeric or character as appropriate. then,
# just put in placeholder values
newdf<-data.frame(Black="n",White="n",Rate=0,stringsAsFactors=FALSE)
newdf[1:rates[1,3],]<-rates[1,c(1,2,4)]
newdf[4:rates[2,3],] <- rates[2,c(1,2,4)]
and so on for each row in your original rates
dataframe.