Question

I'm running a multivariate regression on some trees data.

trees
   Index  DBH Height Merch.Vol.
1      1  8.3     70       10.3
2      2  8.6     65       10.3
3      3  8.8     63       10.2
4      4 10.5     72       16.4
5      5 10.7     81       18.8
6      6 10.8     83       19.7
7      7 11.0     66       15.6
8      8 11.0     75       18.2
9      9 11.1     80       22.6
10    10 11.2     75       19.9
11    11 11.3     79       24.2
12    12 11.4     76       21.0
13    13 11.4     76       21.4
14    14 11.7     69       21.3
15    15 12.0     75       19.1
16    16 12.9     74       22.2
17    17 12.9     85       33.8
18    18 13.3     86       27.4
19    19 13.7     71       25.7
20    20 13.8     64       24.9
21    21 14.0     78       34.5
22    22 14.2     80       31.7
23    23 14.5     74       36.3
24    24 16.0     72       38.3
25    25 16.3     77       42.6
26    26 17.3     81       55.4
27    27 17.5     82       55.7
28    28 17.9     80       58.3
29    29 18.0     80       51.5
30    30 18.0     80       51.0
31    31 20.6     87       77.0
attach(trees)

I can run the regression easily, but I'm having trouble with prediction. I am removing 3 observations randomly and rerunning the regression, then predicting for those three observations in order to calculate MAPE.

g = sample(2:31,3);g
mbreg = lm(trees$Merch.Vol[-g]~DBH[-g]+Height[-g])
p2 = predict(mbreg,trees[g,2:3])
MAPE[2] = MAPE[2] + sum(abs((trees$Merch.Vol[g]-p2)/trees$Merch.Vol[g]))/3

j = sample(2:31,3);j
mLR = lm(log(trees$Merch.Vol[-j])~log(DBH[-j])+log(Height[-j]))
p4 = exp(predict(mLR,trees[j,2:3]))
MAPE[4] = MAPE[4] + sum(abs((trees$Merch.Vol[j]-p4)/trees$Merch.Vol[j]))/3

This works as I would expect it to about 80% of the time, returning three predicted vaules for the three removed observations. But occasionally I get the warning:

Warning message:
'newdata' had 3 rows but variable(s) found have 2 rows 

I don't know where this comes from, as the code works most of the time and I don't have any object that has 2 rows. I have 3 separate calculations like this that each use the trees data. I tried to keep them separate with no common variables, but could they be interfering with each other anyway? Does the warning result from the sampling of g? Is there a better way to remove observations or do multivariate prediction? Thanks you.

P.S. - Also, when I attach trees, I still can't directly call Merch.Vol without trees$Merch.Vol though I can call DBH and Height by themselves. Not a big deal, but if there is an obvious explanation (I'm sure) I'd like to hear it.

Was it helpful?

Solution

The error probably stems from subsetting the data inside the formula in the lm() command. It's the predict() command that actually throws the error. Let's have an example:

# Data
trees<-structure(list(Index = 1:31, DBH = c(8.3, 8.6, 8.8, 10.5, 10.7, 
10.8, 11, 11, 11.1, 11.2, 11.3, 11.4, 11.4, 11.7, 12, 12.9, 12.9, 
13.3, 13.7, 13.8, 14, 14.2, 14.5, 16, 16.3, 17.3, 17.5, 17.9, 
18, 18, 20.6), Height = c(70L, 65L, 63L, 72L, 81L, 83L, 66L, 
75L, 80L, 75L, 79L, 76L, 76L, 69L, 75L, 74L, 85L, 86L, 71L, 64L, 
78L, 80L, 74L, 72L, 77L, 81L, 82L, 80L, 80L, 80L, 87L), Merch.Vol. = c(10.3, 
10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 24.2, 21, 
21.4, 21.3, 19.1, 22.2, 33.8, 27.4, 25.7, 24.9, 34.5, 31.7, 36.3, 
38.3, 42.6, 55.4, 55.7, 58.3, 51.5, 51, 77)), .Names = c("Index", 
"DBH", "Height", "Merch.Vol"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31"))

# This gives an error
g = c(3, 19, 5)
mbreg = lm(Merch.Vol[-g]~DBH[-g]+Height[-g], data=trees)
p2 = predict(mbreg,trees[g,2:3])

# This will work
# Notice that the object trees2 will contain the new, sampled dataset
# The model is then fitted on the dataset trees2
g = c(3, 19, 5)
trees2<-trees[-g,]
mbreg = lm(Merch.Vol~DBH+Height, data=trees2)
p2 = predict(mbreg,trees[g,2:3])

Subsetting (or sampling) the data into a new object before fitting the model using it will remove the error. You might want to change your code example to:

g = sample(2:31,3);g
trees2<-trees[-g,]
mbreg = lm(trees$Merch.Vol~DBH+Height, data=trees2)
p2 = predict(mbreg,trees[g,2:3])
MAPE[2] = MAPE[2] + sum(abs((trees$Merch.Vol[g]-p2)/trees$Merch.Vol[g]))/3

In addition, I'd suggest not to use the attach command here at all. An alternative to it is to use the data argument in the call to lm(). This arguments tells the lm() command to look for the variables mentioned in the formula from the named object (see the example above, and also in R ?lm).

You mention that after attaching the data you still can't call Merch.Vol directly. If you look at the column names closely, you'll probably notice that the correct column name is actually Merch.Vol. with an extra dot in the end. The dollar ($) operator uses column matching, and even if you don't have a column called D in your data, trees$D will return the values from DBH column. That's why trees$Merch.Vol will also work, even if the column name is not exactly correct typed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top