문제

I have a dataset with one of the (important) features being the geographic distances from NYC. Of course, some of the values are missing.... The goal is predicting whether people with certain attributes (proximity being one of them, and the typical age, sex, education, etc. being the others) will engage in an activity in NYC (e.g., visiting MOMA, taking a Broadway show, moving to the city altogether, enrolling in an area school, things like that).

My basic question is - missing values aside - is it correct to just take distances into account "as is" or should they somehow be divided between "driving/train distances" and "flying distances" (essentially, converting them into "the number of hours it takes someone to get to NYC by the most efficient means")?

If we take Los Angeles and Richmond, VA as examples - the distance to NYC from LA is about 10 times that of Richmond; flight times are only 4 times longer, but flight time from LA and drive time from Richmond are approximately the same. So what's the right way to think about that?

And once the right approach to distances is determined, how does one go about imputing distances for missing values?

올바른 솔루션이 없습니다

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top