Question

I have a practical question about feature engineering... say I want to predict house prices by using logistic regression and used a bunch of features including zip code. Then by checking the feature importance, I realize zip is a pretty good feature, so I decided to add some more features based on zip - for example, I go to census bureau and get the average income, population, number of schools, and number of hospitals of each zip. With these four new features, I find the model performances better now. So I add even more zip-related features... And this cycle goes on and on. Eventually the model will be dominated by these zip-related features, right?

My questions:

  1. Does it make sense doing these in the first place?
  2. If yes, how do I know when is a good time to stop this cycle?
  3. If not, why not?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top