Assigning values to missing target vector values in scikit-learn

https://datascience.stackexchange.com/questions/9612

16-10-2019
|

Pergunta

I have a dataset containing data on temperature, precipitation, and soybean yields for a farm for 10 years (2005 - 2014). I would like to predict yields for 2015 based on this data.

Please note that the dataset has DAILY values for temperature and precipitation, but only 1 value per year for the yield (since harvesting of crop happens at end of growing season of crop).

I would like to build a regression or some other machine learning based model to predict 2015 yields, based on a regression/some other model derived by studying the relation between yields and temperature and precipitation in previous years.

As per, Building a machine learning model to predict crop yields based on environmental data, I am using sklearn.cross_validation.LabelKFold to assign each year the same label.

The question is that since I have a single target value per year, do I need to interpolate to fill in target values for all the other days of the year? Should I just use the same target value for each day of the year?

Solução

The model likely won't have much predictive power if the input is a single day. No weather patterns longer than one day can be captured that way.

Instead you should aggregate the days together. You can come up with different features that describe your larger, aggregated unit of time (months, year). For example mean precipitation is a very simple one. Binning the data and using counts within those bins would also work.

More advanced options would roll the time all the way up to a full year and learn a feature set at that level.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange