Question

I am working through the Russell and Norvig AI book and came across the following on the top of page 706.

The section concerns Decision Tree pruning and testing a given attribute against the null hypothesis. The example at hand deals with a binary decision tree and $p_k$ denotes the number of positive ("yes") responses in the training data and $n_k$ the number of negative ("no") responses in the training data with respect to the training examples that are in the subset of values that took on the value $v_k$ associated with the $k$th attribute of a vector $\vec{x}$ of length $n \times 1$. Furthermore $p$ and $n$ simply denote the number of positive and negative outputs of the training data.

The authors state the following:

We can measure the deviation by comparing the actual numbers of positive and negative examples in each subset $p_k$ and $n_k$ with the expected numbers $\hat{p}_k$ and $\hat{n}_k$, assuming true irrelevance:

\begin{align*} \hat{p}_k&=p\times \frac{p_k+n_k}{p+n}\\ \hat{n}_k&=n\times \frac{p_k+n_k}{p+n}\,. \end{align*}

A convenient measure of the total deviation is given by

$$\Delta = \sum_{k=1}^d \frac{(p_k-\hat{p}_k)^2}{\hat{p}_k} + \frac{(n_k-\hat{n}_k)^2}{\hat{n}_k}\,.$$

My questions:

I think I understand the use of the estimators. We are basically assuming a uniform distribution on $p$ and $n$ and so within each subset of training examples that have the attribute value $A_k$ equal to $v_k$ we simply scale the overall $p$ and $n$ by the total proportion of samples in that set. In essence, we are assuming that, for instance, if $p=10$ and $n=5$, then those proportions hold in our subsets, and so the expected value of $\hat{p}_k$ is simply scaled by the size of the subset.

This leads me to also understanding the numerator of the measure of deviation $\Delta$. We use the squared deviation of the expectation from the observed value. My question then is why do the estimators appear in the denominator? This seems very close to an equation for variance, and I assume that the estimators are somehow scaling $\Delta$, but how can we justify that?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top