Question

I've been using the aov() function in R for ages. I always input my data via .csv files, and have never bothered converting any of the variables to 'factor'.

Recently I've done just that, converting variables to factors and repeated the aov(), and the results of the aov() are now different.

My data are ordered categories, 0,1,2. Un-ordered or ordered levels makes no difference, both are different than using the variable without converting to a factor.

Are factors always appropriate? Why does this conversion make such a large difference?

Please let me know if more information is necessary to make my question clearer.

No correct solution

OTHER TIPS

This is really a statistical question, but yes, it can make a difference. If R treated the variable as numeric, in a model it would account for only a single degree of freedom. If the levels of the numeric were 0, 1, 2, as a factor it would use two degrees of freedom. This would alter the statistical outputs from the model. The difference in model complexity between the numeric and factor representations increase markedly if you multiple factors coded numerically or the variables have more than a few levels. Whether the increase in explained sums-of-squared from the inclusion of a variable is statistically significant depends on the magnitude of the increase and the change in the complexity of the model. Using a numeric representation of a class variable would increase the model complexity by a single degree of freedom, but the class variable would use k-1 degrees of freedom. Hence for the same improvement in model fit, you could be in a situation where whether coding a variable a numeric or a factor changes whether it has a significant effect on the response.

Conceptually, the models based on numerics or factors differ; with factors you have a small set of groups or classes that have been sampled and the aim is to see whether the response differs between these groupings. The model is fixed on the set of samples groups; you can only predict for those groups observed. With numerics, you are saying that the response varies linearly with the numeric variable(s). From the fitted model you can predict for some new values of the numeric variable not observed.

(Note that the inference for fixed factors assumes you are fitting a fixed effects model. Treating a factor variables as a random effect moves the focus from the exact set of groups sampled on to the set of all groups in the population from which the sample was taken.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top