solutions for factor levels when using linear models in R

https://stackoverflow.com//questions/9657839

11-12-2019
|

Question

I am running linear models to look at the significance of independent factors involved. The example model is: `

mymod1 <- lm(temp ~ bgrp+psex+tb,data=mydat)
summary(mymod1)`

I look at the summary to check out the significance of each factor:

lm(formula = temp ~ bgrp + psex + tb, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.324459   0.186081 200.581  < 2e-16 ***
bgrp         0.256794   0.066167   3.881 0.000115 ***
psex         0.144669   0.055140   2.624 0.008913 ** 
tb           0.019818   0.009342   2.121 0.034287 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.03675,    Adjusted R-squared: 0.03209 
F-statistic: 7.897 on 3 and 621 DF,  p-value: 3.551e-05

Now, I would like to look at the solutions of the two levels of bgrp (1 and 2) and psex (1 and 2).

I would appreciate if you could help me with this.

Thanking you in advance,

Baz

EDIT:

I ran the first model you suggested and got the following:

mydat$bgrp <- as.factor(mydat$bgrp)

> summary(lm(temp ~ bgrp+psex+tb-1,data=mydat))

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = apirt)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
bgrp1 37.725922   0.135486 278.449  < 2e-16 ***
bgrp2 37.982716   0.129558 293.171  < 2e-16 ***
psex2  0.144669   0.055140   2.624  0.00891 ** 
tb     0.019818   0.009342   2.121  0.03429 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16

From the above coefficient table, bgrp1 and bgrp2 seem to make sense: bgrp1 represents the maternal lines with larger litter sizes, lighter offsprings, which results in lower rectal temperature (37.70 degrees C) of the offspring. On the other hand, bgrp2 represents the terminal lines with smaller litter size, heavier offsprings, which results in higher a rectal temperature (37.98 degrees C). I am just wondering, if the same could be done for psex1 and psex2, but what is presented in the table of coefficients could be due to what you said earlier.

EDIT: Hi Mark,

I tried the two options you suggested and I could see that bgrp1 and psex1 are taking on the same values:

> mybgrp <- lm(formula = temp ~ bgrp+psex+tb-1, data = mydat)
> mybgrp

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = mydat)

Coefficients:
   bgrp1     bgrp2     psex2        tb  
37.72592  37.98272   0.14467   0.01982  

> summary(mybgrp)

Call:
lm(formula = temp ~ bgrp + psex + tb - 1, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
bgrp1 37.725922   0.135486 278.449  < 2e-16 ***
bgrp2 37.982716   0.129558 293.171  < 2e-16 ***
psex2  0.144669   0.055140   2.624  0.00891 ** 
tb     0.019818   0.009342   2.121  0.03429 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16 

> mypsex <- lm(formula = temp ~ psex+bgrp+tb-1, data = mydat)
> mypsex

Call:
lm(formula = temp ~ psex + bgrp + tb - 1, data = mydat)

Coefficients:
   psex1     psex2     bgrp2        tb  
37.72592  37.87059   0.25679   0.01982  

> summary(mypsex)

Call:
lm(formula = temp ~ psex + bgrp + tb - 1, data = mydat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6877 -0.2454  0.0768  0.3916  1.6561 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
psex1 37.725922   0.135486 278.449  < 2e-16 ***
psex2 37.870591   0.135908 278.649  < 2e-16 ***
bgrp2  0.256794   0.066167   3.881 0.000115 ***
tb     0.019818   0.009342   2.121 0.034287 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6888 on 621 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared: 0.9997,     Adjusted R-squared: 0.9997 
F-statistic: 4.788e+05 on 4 and 621 DF,  p-value: < 2.2e-16

Thanks!

Solution

If there are only two levels to a variable (1 vs 2), that is the same as (0 vs 1) and the slope is for one of those 2 levels. The other level of the variable is included in the intercept term.

Maybe try

lm(formula = temp ~ bgrp + psex + tb - 1 , data = mydat)

to remove the intercept and see if that gives you what you want.

Then again, maybe I do not understand your question correctly.

Edit:

When I use fake data and set

bgrp <- as.factor(bgrp)
psex <- as.factor(psex)

without an intercept I get 2 slopes for one of the 2 factors. I believe R is holding the second slope for the second factor = 0.

Edit2:

This model will provide separate slopes for each combination of bgrp and psex. The model includes an interaction between bgrp and psex and then removes the intercept and the bgrp and psex main effects:

mymod2 <- lm(temp ~ bgrp + psex + bgrp * psex + tb - 1 - bgrp - psex)

Edit3:

If you are used to using SAS and try running the same analysis in SAS and R you might find that the two programs initially do not return the same estimates. That may be because SAS and R select different factor levels for the intercept by default. You can alter the default factor level for the intercept in R to match that used by SAS and then you might find that both programs give the same answer.

Compare the following R code with the SAS output from here:

http://support.sas.com/kb/38/384.html

where the SAS code makes use of the option 'solution':

my.data <- matrix(c(
'A', 'F',   9, 25,  
'A', 'F',   3, 19,  
'A', 'F',   4, 18,  
'A', 'F',  11, 28,  
'A', 'F',   7, 23,
'A', 'M',  11, 27,  
'A', 'M',   9, 24,  
'A', 'M',   9, 25,  
'A', 'M',  10, 28,  
'A', 'M',  10, 26,
'D', 'F',   4, 37,  
'D', 'F',  12, 54,  
'D', 'F',   3, 33,  
'D', 'F',   6, 41,  
'D', 'F',   9, 47,
'D', 'M',   5, 36,  
'D', 'M',   4, 36,  
'D', 'M',   7, 40,  
'D', 'M',  10, 46,  
'D', 'M',   8, 42,
'G', 'F',  10, 70,  
'G', 'F',  11, 75,  
'G', 'F',   7, 60,  
'G', 'F',   9, 69,  
'G', 'F',  10, 71,
'G', 'M',   3, 47,  
'G', 'M',   8, 60,  
'G', 'M',  11, 70,  
'G', 'M',   4, 49,  
'G', 'M',   4, 50
), nrow = 30, byrow=T, 
dimnames = list(NULL, c("drug","gender","x","y")));


my.data <- as.data.frame(my.data, stringsAsFactors=F)
my.data

my.data$y      <- as.numeric(my.data$y)
my.data$x      <- as.numeric(my.data$x)
my.data$drug   <- as.factor(my.data$drug)
my.data$gender <- as.factor(my.data$gender)

str(my.data)

my.data$drug   <- relevel(my.data$drug, ref="G")
my.data$gender <- relevel(my.data$gender, ref="M")



my.mod1 <- lm(my.data$y ~ my.data$drug)
my.mod1
summary(my.mod1)

my.mod2 <- lm(my.data$y ~ my.data$drug-1)
my.mod2
summary(my.mod2)

my.mod3 <- lm(my.data$y ~ my.data$drug + my.data$gender + 
                          my.data$drug * my.data$gender )
my.mod3
summary(my.mod3)

my.mod4 <- lm(my.data$y ~ my.data$drug + my.data$gender + 
                          my.data$drug * my.data$gender - 1 )
my.mod4
summary(my.mod4)

my.mod5 <- lm(my.data$y ~ my.data$drug + my.data$x + 
                          my.data$drug * my.data$x )
my.mod5
summary(my.mod5)

my.mod6 <- lm(my.data$y ~ my.data$drug + my.data$x + 
                          my.data$drug * my.data$x - 1 )
my.mod6
summary(my.mod6)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow