سؤال

Following is what I learned about the process followed during building and pruning a decision tree, mathematically (from Introduction to Machine Learning by Gareth James et al.):

  1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
  2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α.
  3. Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k = 1, . . .,K: (a) Repeat Steps 1 and 2 on all but the kth fold of the training data. (b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results for each value of α, and pick α to minimize the average error.
  4. Return the subtree from Step 2 that corresponds to the chosen value of α. Error rate vs Leafs

My question: Going through the above algorithm means we automatically choose the most optimum tree based on minimum average error. Why then, do we have to select maximum depth as the pruning parameter. Shouldn't optimum depth be decided by the algorithm itself and not us?

If anything, shouldn't we be choosing 'K' here, as in how many-fold validation would we like to do?

هل كانت مفيدة؟

المحلول

Can you clarify pls, where is the suggestion, that we have to select max depth for pruning? As you said it is supposed to be done automatically due to some criterion. Here is some example of post-pruning:

https://stackoverflow.com/questions/49428469/pruning-decision-trees

Yes, we should select 'K' fold to leave some data for the test of pruning efficiency. How many? It depends on your data. You can check out this Stanford lecture slides, for example, where K = 10 is suggested:

https://web.stanford.edu/class/stats202/content/lec19.pdf

as an example. Usually, the K will depend on the size of your data, based on the size of the test fold you want to leave out.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top