Question

My data looks like this:

enter image description here

birth_date has 634,990 missing values
gender has 328,849 missing values

Both of these are a substantial amounts since I have 900k entries, so I can't discard empty rows. For birth_date someone recommended using Multivariate imputation by Chained equation (MICE). I don't know what predictive model I should use for gender. Of the non-missing data, there are 5x more males than females.

Can someone tell me what would be best practice here? What would be the best way to fill in the missing values for gender ?

I'm using the data to predict bike-ride duration and final destination (I know they're shown on the table above)

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top