سؤال

I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.

Here's my data:

ecoli_ug_ml  A420 rpt
1            0 0.000   1
2           10 0.129   1
3           20 0.257   1
4           30 0.379   1
5           40 0.479   1
6           50 0.579   1
7           60 0.673   1
8           70 0.758   1
9           80 0.838   1
10          90 0.912   1
11         100 0.976   1
12           0 0.000   2
13          10 0.126   2
14          20 0.257   2
15          30 0.382   2
16          40 0.490   2
17          50 0.592   2
18          60 0.684   2
19          70 0.772   2
20          80 0.847   2
21          90 0.917   2
22         100 0.977   2
23           0 0.000   3
24          10 0.125   3
25          20 0.258   3
26          30 0.376   3
27          40 0.488   3
28          50 0.582   3
29          60 0.681   3
30          70 0.768   3
31          80 0.846   3
32          90 0.915   3
33         100 0.977   3

My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)

ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2

When I view this, the fit of this line to the points is spectacularly good.

When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:

coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))                 
                  (Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1 
                -1.979021e-03                      1.374789e-02 
poly(ecoli_ug_ml, 2, raw = TRUE)2 
                -3.964258e-05 

Therefore, I assume the plot is actually not quite right either.

So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.

I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!

To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well

هل كانت مفيدة؟

المحلول

summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))

# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
#                                  Estimate Std. Error t value Pr(>|t|)    
# (Intercept)                    -1.979e-03  1.926e-03  -1.028    0.312    
# poly(ecoli_ug_ml, 2, raw = T)1  1.375e-02  8.961e-05 153.419   <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05  8.631e-07 -45.932   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared:  0.9998,  Adjusted R-squared:  0.9998 
# F-statistic: 8.343e+04 on 2 and 30 DF,  p-value: < 2.2e-16

So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.

You can force a fit without the intercept this way (note the -1 in the formula):

summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))

# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
#                                  Estimate Std. Error t value Pr(>|t|)    
# poly(ecoli_ug_ml, 2, raw = T)1  1.367e-02  5.188e-05  263.54   <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05  6.396e-07  -61.05   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 3.4e+05 on 2 and 31 DF,  p-value: < 2.2e-16

Note that the coefficients do not change appreciably.

EDIT (Response to OP's comment)

The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:

ggplot(calib, aes(ecoli_ug_ml, A420)) +
  geom_point(shape=calib$rpt) +
  stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
  stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
  scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
  scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))

The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."

The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.

Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.

نصائح أخرى

I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:

   ....+
   stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...

You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.

  coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
      (Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2 
        0.5466667             1.7772858            -0.2011251 

So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.

Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:

 mod <-   lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
 ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
 ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
    geom_point(shape=ecoli$rpt)  +
    scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
    scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
    geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top