Question

Say I have an instance space with 4 features and I know that a decision tree with 8 nodes can represent the target function I want to learn. I want to give an upper bound on the size of the sample set needed in order to achieve a true error of at most x%.

I found this theorem in a text book.

Let $\mathcal{H}$ be an hypothesis class and let $\epsilon, \delta > 0$. If a training set $S$ of size

$$n\geq\frac{1}{\epsilon}\operatorname{ln}\left(\frac{|\mathcal{H}|}{\delta}\right)$$

is drawn from distribution $\mathcal{D}$, then with probability greater than or equal to $1-\delta$ every $h \in \mathcal{H}$ with true error $err_D(h) \geq \epsilon$ has training error $err_S(h) > 0$. Equivalently, with probability greater than or equal to $1-\delta$, every $h \in \mathcal{H}$ with training error zero has true error less than $\epsilon$.

My question is, can I use this theorem to give such an upper bound? If yes, what would the size of my hypothesis space $|\mathcal{H}|$ be in this case? If not, how can I give this upper bound?

No correct solution

OTHER TIPS

As far as I understand this theorem (not much), it doesn't really say anything about the relation between $n$ (the size of the training set) and the error rate. The main relation that it claims is between the error on the training set and the true error, assuming a minimum size on the size of the training set and that the training set is drawn from the true distribution.

I don't think it's possible to give an absolute upper bound on the number of samples needed in the training set for at least the following reasons:

  • The success of the learning depends on which instances are in the training set, i.e. one needs a sample representative enough so that the model correctly captures the distribution. For instance, in theory it could happen by chance that all the $n$ instances have the same value for feature $x$, so the model would never know that feature $x$ can have another value. In general there's no way to be sure that the training set contains all the diversity required for a successful model in the instances.
  • There's also the question of what the learning algorithm does exactly: if one wants to prove that a particular DT learning algorithm (say C4.5) can learn a particular model with less than $n$ instances (assuming these $n$ instances are sufficient), the proof must involve the particular properties of the algorithm itself. Otherwise it's obvious that the claim is false: if one chooses an algorithm which randomly builds a tree, it's clear that this cannot lead to the right model.

I think it's more feasible and more common to provide an experimental upper bound: run many cross-validation experiments, varying the training set size and possibly other parameters. Then based on this one can estimate the probability of reaching the correct model given the size and other parameters.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top