Different strategies for dealing with features with multiple values per sample in python machine learning models

https://datascience.stackexchange.com/questions/86244

17-12-2020
|

Question

I have a dataset which contains pregnancy, maternal, foetal and children data and I am developing a predictive machine learning model to predict adverse pregnancy outcomes.

The dataset contains mostly features with a single value per pregnancy, e.g. maternalObesity = ["Yes", "No]. However, I have some features that have multiple values per pregnancy, such as the foetal abdominal circumference and estimated foetal weight which have been recorded multiple times at different times during gestation (so each pregnancy will have between 1 and 26 observations for these features each), like so:

PregnancyID     gestationWeek    abdomcirc     maternalObesity
1               13               200           Yes
1               18               240           Yes
1               30               294           Yes
2               11               156           No
2               20               248           No

So in pregnancy 1, we can see that the abdominal circumference was recorded 3 times at weeks 13, 18 and 30.

All questions I have seen here which have addresses the issue of multiple values per sample have been about categorical features, like this and this. Here the suggested solution was to OneHotEncode the features. However, like I said, this does not apply in my case as I have continuous (float) variables.

I have spent the last few months attempting different methods to best handle these features such that I don't lose any valuable information. Simply adding these features into in my dataset will result in almost duplicate rows as the vast majority of my samples have single values (like in the table above.

Here are some of the different approaches I have considered to handle these features:

Derive statistical values from the features, like here. So I compute mean the maximum, minimum, variance, range, etc. of all the observations per pregnancy. However, the downfall with this approach is that the time at which the values are recorded is neglected. The time of the measurement may be significant as a higher abdominal circumference earlier in pregnancy may be more correlated with the adverse outcome I am trying to predict.
Summarise the measurements into a fixed number of features by grouping them into 3 trimester, like here. So I can group all measurements by 3 trimesters, and each feature would hold the maximum measurement recorded during that trimester.

So my dataset will look like this:

PregnancyID     MotherID    abdomCirc1st  abdomCirc2nd   abdomCirc3rd
1               1           200           315            350
2               2           156           248            NaN

This approach takes into account the time range of the measurement, but will result in a lot of NaNs in the new features, as many pregnancies do not have a measurement for each trimester. Also, the maximum may result in some statistical information being lost, unlike approach 1.

I initially thought about using a python list for these features. However, I do not know if a machine learning model can handle this data type, and again, the time each measurement was taken is neglected in this approach.

So my data will look something like this:

PregnancyID     maternalObesity    abdomcirc
1               Yes                [200, 240, 294]
2               No                 [156, 248]

In conclusion, I need some guidance as I have found a lack of examples and resources out there about this issue. So please advise what the best approach is in this case and if there are any detailed examples out there that address this issue I would appreciate it.

Solution

You should make the "abdomCirc" as Multiple separate features (e.g. one for each month).

Then handle NaN as it should be i.e.

Remove columns that are having more NaN than a Threshold
Try to fill NaN for cases where only a few are missing.

If we can't accommodate the above solution because of NaN counts, then we should accept that we don't have enough data and ignore the feature-set (i.e. abdomCirc) altogether.
Otherwise you may be at risk of finding a pattern that doesn't exist because of very small amount of available data points for abdomCirc.

I initially thought about using a python list for these features. However, I do not know if a machine learning model can handle this data type

None of the Model will accept this data i.e. a List

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange