Feature Normalization/Scaling: Prediction Step

https://datascience.stackexchange.com/questions/19854

22-10-2019
|

Question

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis, $$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$ Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!

Solution

I have learned what the correct answer is: you have to transform your prediction location in precisely the same way you do for the columns of matrix $X$: first, subtract the means element-wise from each component of the prediction location $x$. Second, divide element-wise by the standard deviation. Third, prepend a $1$ to allow for the bias. Finally, perform the dot product with your $\theta$ vector. Since the $\theta$ vector was calculated on transformed data, it's meant to operate on transformed data. It will contain units and scaling appropriate to produce an answer in engineering units. Reference: see Week 2, FAQ, Question 8 in Andrew Ng's Machine Learning course on Coursera. (Login is probably necessary.)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange