سؤال

I have a dataset as following: enter image description here

This is test case 1. My goal is to fill the missing years data. As the age sex and smoking is not changing so I have to predict the condition and percent data for year 0 to all the way 54. I found high correlation between condition and percent variable. This seems easy. But I am a bit confused now. Should I have to use multivariable regression? what would be the most best method to approach?

هل كانت مفيدة؟

المحلول

Best approach would be to perform data preparation first:

  • Remove features (columns) with no variance in it (you could use: sklearn feature_selection)
  • one-hot-encoding of categorical features
  • insert a lag column of -t steps

If you have more than one explanatory variable, the process is called multiple linear regression. Instead of using a regression model you could also use other learners like XGBoost or LSTMs

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top