I'm anticipating that I'm missing something glaringly obvious here.
I'm trying to build a demonstration of overfitting. I've got a quadratic generating function from which I've drawn 20 samples, and I now want to fit polynomial linear models of increasing degree to the sampled data.
For some reason, regardless which model I use, every time I run predict I get N predictions back, where N is the number of records used to train my model.
set.seed(123)
N=20
xv = seq(1,5,length.out=1e4)
x=sample(xv,N)
gen=function(v){v^2 + 2*rnorm(length(v))}
y=gen(x)
df = data.frame(x,y)
# convenience function for building formulas for polynomial regression
build_formula = function(N){
fpart = paste(lapply(2:N, function(i) {paste('+ poly(x,',i,',raw=T)')} ), collapse="")
paste('y~x',fpart)
}
## Example:
## build_formula(4)="y~x + poly(x, 2 ,raw=T)+ poly(x, 3 ,raw=T)+ poly(x, 4 ,raw=T)"
model = lm(build_formula(10), data=df)
predict(model, data=xv) # returns 20 values instead of 1000
predict(model, data=1) # even *this* spits out 20 results. WTF?
This behavior is present regardless of the degree of polynomial in the formula, including the trivial case 'y~x':