Question

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.

But when i try to create the decision tree:

model<-tree(salary~.,data)

I get the error like below:

 *Error in tree(salary ~ ., data) : 
  factor predictors must have at most 32 levels* 

What is wrong with that? Data is as follows:

      Name bat hit homeruns runs
1   Alan Ashby 315  81        7   24
2  Alvin Davis 479 130       18   66
3 Andre Dawson 496 141       20   65
...
team position putout assists errors
1 Hou.        C    632      43     10
2 Sea.       1B    880      82     14
3 Mon.       RF    200      11      3
salary league87 team87
1    475        N   Hou.
2    480        A   Sea.
3    500        N   Chi.

And its the value of str(data):

'data.frame': 263 obs. of 24 variables: $ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...

$ bat : int 315 479 496 321 594 185 298 323 401 574 ...

$ hit : int 81 130 141 87 169 37 73 81 92 159 ...

$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...

$ runs : int 24 66 65 39 74 23 24 26 49 107 ...

$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...

$ walks : int 39 76 37 30 35 21 7 8 65 59 ...

$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...

$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...

$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...

$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...

$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...

$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...

$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...

$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...

$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...

$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7 8 ...

$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...

$ putout : int 632 880 200 805 282 76 121 143 0 238 ...

$ assists : int 43 82 11 40 421 127 283 290 0 445 ...

$ errors : int 10 14 3 4 25 7 9 19 0 22 ...

$ salary : num 475 480 500 91.5 750 ...

$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...

$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...

Was it helpful?

Solution

The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:

train <- data
train$Name <- NULL
model<-tree(salary~.,train)

OTHER TIPS

It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:

http://cran.r-project.org/web/packages/tree/tree.pdf

Usage

tree(formula, data, weights, subset, na.action = na.pass, control = tree.control(nobs, ...), method = "recursive.partition", split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts = TRUE, ...)

Arguments

formula A formula expression. The left-hand-side (response) should be either a numerical vector when a regression tree will be fitted or a factor, when a classification tree is produced. The right-hand-side should be a series of numeric or factor variables separated by +; there should be no interaction terms. Both . and - are allowed: regression trees can have offset terms. (...)

Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:

salary = as.numeric(levels(salary))[salary]

EDIT

As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.

EDIT2

It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top