very simple anova model - compare variables with factors vs different vectors [closed]

https://stackoverflow.com/questions/13329783

r
anova

28-11-2021
|

Question

I think my question is best understood with a little code:

#Load data
b <- structure(list(s1 = c(18.5, 24, 17.2, 19.9, 18), s2 = c(26.3, 
     25.3, 24, 21.2, 24.5), s3 = c(20.6, 25.2, 20.8, 24.7, 22.9), 
     s4 = c(25.5, 19.9, 22.6, 17.5, 20.4)), .Names = c("s1", "s2", 
     "s3", "s4"), row.names = c(NA, -5L), class = "data.frame")

# Model A
# One way (the wrong way) to test wether s1,s2,s3,s4 differs:
summary(aov(s1~s2+s3+s4, data=b))
# R does not complain here - but I don't know what I am doing. I guess I am trying
# to explain the variance in s1, with the variable s2,s3 and s4.
# I am not sure how this actually is different from a proper anova (see below).
# Also I dont understand why the Sum of Squares for s3 is much larger than the sum of    
# squares for s2 and s4.

# Model B
# The correct way to do it (requires reshape)
# install.packages('reshape')
# library(reshape)

summary(aov(value ~variable, data=melt(b)))
# This is correct - I am here testing variation within the factors of 'variable',
# to explain variation in 'value'.
# Doing 
TukeyHSD(aov(value ~variable, data=melt(b)))
# shows me that s1 is significantly different from s2.
# My way of thinking is that this result should be evident from "model A"
# What does Sum of Squares in model A mean? - why is it so big for s3?

So as from the comments in the code above: I am asking for an explanation of how and why Model A is wrong.

Solution

Model A is not ANOVA. You are modelling one response variable (s1) using s2, s3 and s4 as predictors; this is analysis of covariance here. The reason why it is so big for s3 will become apparent if you plot a correlation matrix; cor( b ) will show you

            s1          s2         s3         s4
s1  1.00000000  0.08009315  0.7929146 -0.4200527
s2  0.08009315  1.00000000 -0.4433499  0.7846037
s3  0.79291464 -0.44334985  1.0000000 -0.8725241
s4 -0.42005268  0.78460371 -0.8725241  1.0000000

You cannot compare it with the model B, where you treat s1 as predictor rather than response variable, and your response variable is the class (s1, s2, s3 or s4).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow