Multiple Linear Regression with Dichotomous Predictor Variables in R: to dummy-code or let R handle it?

StackOverflow https://stackoverflow.com/questions/22567410

  •  19-06-2023
  •  | 
  •  

I am running a multiple linear regression for a course using R. One of my predictor variables that I want to include in the model is the sex of the individual coded as "m" and "f". I ran the model in R two different ways:

Model 1: "Sex" as the original categorical variable R

lm(formula = P_iP_Choice ~ Sex + Carapace + Competitor_Presence_BI + 
    PSI_Day1_Choice + AGG_AVERAGE, data = pano2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55241 -0.12879 -0.04414  0.13769  0.67394 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.43031    0.23872  -1.803 0.074353 .  
Sexm                   -0.28566    0.04685  -6.098 1.86e-08 ***
Carapace                0.15558    0.04534   3.431 0.000863 ***
Competitor_Presence_BI -0.03339    0.04532  -0.737 0.462870    
PSI_Day1_Choice         0.15825    0.13029   1.215 0.227273    
AGG_AVERAGE             0.15406    0.07790   1.978 0.050604 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2375 on 104 degrees of freedom
Multiple R-squared: 0.3146, Adjusted R-squared: 0.2817 
F-statistic: 9.549 on 5 and 104 DF,  p-value: 1.611e-07 

Model 2: Sex of the individuals as a different variable "Female" which was coded 0=males 1=females.

lm(formula = P_iP_Choice ~ Female + Carapace + Competitor_Presence_BI + 
    PSI_Day1_Choice + AGG_AVERAGE, data = pano2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55241 -0.12879 -0.04414  0.13769  0.67394 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.71597    0.24498  -2.923 0.004260 ** 
Female                  0.28566    0.04685   6.098 1.86e-08 ***
Carapace                0.15558    0.04534   3.431 0.000863 ***
Competitor_Presence_BI -0.03339    0.04532  -0.737 0.462870    
PSI_Day1_Choice         0.15825    0.13029   1.215 0.227273    
AGG_AVERAGE             0.15406    0.07790   1.978 0.050604 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2375 on 104 degrees of freedom
Multiple R-squared: 0.3146, Adjusted R-squared: 0.2817 
F-statistic: 9.549 on 5 and 104 DF,  p-value: 1.611e-07

My understanding is that the difference in the coefficient of the intercept term is caused because in Model 1 R dummy-coded my categorical variable into a dichotomous variable and the variation in my response variable explained by "females" gets added to the intercept because R coded "Sex" using males. However in Model 2 the difference in running the model does not change the estimates of coefficients for other parts of my model.

What I would like to know is what is the "correct" or widely accepted method of using dichotomous categorical variables in linear models? Dummy coding them yourself? Or letting R dummy code it?

有帮助吗?

解决方案

Either way is correct (assuming you do the manual coding properly), but there is a but. R supports several coding schemes (contrasts) for categorical variables. There is Dummy Coding, Deviation Coding, Helmert Coding, etc. What changes in each of these schemes is the meaning of intercept, and the interpretation of parameters. For instance, with dummy coding you compare all categories against a single base category, and the intercept is the mean for the base category (all other parameters being zero). With deviation coding, you intercept is the grand (!) mean, and your parameters are deviations from this grand mean. For example, if you are conducting country analysis, it is not always useful to compare every country against, say, France. Instead, you might want to compare each country to some mean, say, for the European Union.

This also goes for dichotomous variables. Do you want to compare men to women, or would you rather compare men to grand mean, and women to grand mean? Both are feasible, depending on your research context.

Now, when you use manual coding, you make no error. Yet you cannot quickly switch from one coding system to another, you'll have to recode everything manually again. For more complex coding systems you'll have some chance to make a mistake by doing it manually. And this may not matter much for dichotomous variables, but if you have more categories, creating dummies manually will clutter up your dataset and may result in confusion when you return to your analysis in a few months. Just a few arguments to use the automatic coding.

You can find additional information on coding systems in R here. It is a useful read and gives you more flexibility within the context of regression. Good luck!

其他提示

Just to expand a bit on @BenBolker's comment.

In your first model, R takes Sex=F as the baseline and reports that the intercept is -0.43031. If Sex=M the whole model is shifted by -0.28566 (the coefficient of Sexm). So Sexm is not the impact of males, it is the difference between the models when Sex=F and Sex=M. None of the other parameters are affected by this because you have linear model with no interactions. So when Sex=M you would have an identical model, but with the intercept being -0.43031 + (-0.28566) = -0.71597.

In your second model, Female is a numeric predictor. The intercept occurs when Female=0 (e.g., Sex=M) and , at -0.71597, is equivalent to the first model. Again, none of the other parameters is different because thie is a linear model with no interactions.

IMO the "correct" way depends on your audience. The idiomatic way to deal with categorical variables is the first - make it a factor. However I have found that with non-technical, or "less-technical" audiences the second way is much easier to explain and understand. Note of course that this applies to dichotomous variables only - if your categorical variable can take on more than two values you must use factors.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top