Question

I have a dataset with one of the (important) features being the geographic distances from NYC. Of course, some of the values are missing.... The goal is predicting whether people with certain attributes (proximity being one of them, and the typical age, sex, education, etc. being the others) will engage in an activity in NYC (e.g., visiting MOMA, taking a Broadway show, moving to the city altogether, enrolling in an area school, things like that).

My basic question is - missing values aside - is it correct to just take distances into account "as is" or should they somehow be divided between "driving/train distances" and "flying distances" (essentially, converting them into "the number of hours it takes someone to get to NYC by the most efficient means")?

If we take Los Angeles and Richmond, VA as examples - the distance to NYC from LA is about 10 times that of Richmond; flight times are only 4 times longer, but flight time from LA and drive time from Richmond are approximately the same. So what's the right way to think about that?

And once the right approach to distances is determined, how does one go about imputing distances for missing values?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top