How to compute score and predict for outcome after N days

https://datascience.stackexchange.com/questions/85703

16-12-2020
|

Question

Let's say I have a medical dataset/EHR dataset that is retrospective and longitudinal in nature. Meaning one person has multiple measurements across multiple time points (in the past). I did post here but couldn't get any response. So, posting it here

This dataset contains information about patients' diagnosis, mortality flag, labs, admissions, and drugs consumed, etc.

Now, if I would like to find out predictors that can influence mortality, I can use logistic regression (whether the patient will die or not).

But my objective is to find out what are the predictors that can help me predict whether a person will die in the next 30 days or the next 240 days, how can I do this using ML/Data Analysis techniques?

In addition, I would also like to compute a score that can indicate the likelihood that this person will die in the next 30 days? How can I compute the scores? Any tutorials links on how is this score derived?, please?

Can you please let me know what are the different analytic techniques that I can use to address this problem and different approaches to calculate score?

I would like to read and try solving problems like this

Solution

This could be seen as a "simple" binary classification problem. I mean the type of problem is "simple", the task itself certainly isn't... And I'm not even going to mention the serious ethical issues about its potential applications!

First, obviously you need to have an entry in your data for a patient's death. It's not totally clear to me if you have this information? It's important that whenever a patient has died this is reported in the data, otherwise you cannot distinguish the two classes.

So the design could be like this:

An instance represents a single patient history at time $t$, and it is labelled as either alive or dead at $t+N$ days.
This requires refactoring the data. Assuming data spans a period from 0 to $T$, you can take multiple points in time $t$ with $t<T-N$ (for instance every month from 0 to $T-N$). Note that in theory I think that different times $t$ for the same patient can be used in the data, as long as all the instances consistently represent the same duration and their features and labels are calculated accordingly.
Designing the features is certainly the tricky part: of course the features must have values for all the instances, so you cannot rely on specific tests which were done only on some of the patients (well you can, but there is a bias for these features).
- To be honest I doubt this part can be done reliably: either the features are made of standard homogeneous indicators, but then these indicators are probably poor predictors of death in general; or they contain specialized diagnosis tests for some patients but then they are not homogeneous across patients, so the model is going to be biased and likely to overfit.

Ideally I would recommend splitting between training and test data before even preparing the data in this way, typically by picking a period of time for training data and another for test data.

Once the data is prepared, in theory any binary classification method can be applied. Of course a probabilistic classifier can be used to predict a probability, but this can be misleading so be very careful: the probability itself is a prediction, it cannot be interpreted as the true chances of the patient to die or not. For example Naive Bayes is known to empirically always give extreme probabilities, i.e. close to 0 or close to 1, and quite often it's completely wrong in its prediction. This means that in general the predicted probability is only a guess, it cannot be used to represent confidence.

[edit: example]

Let's say we have:

data for years 2000 to 2005
N=1, i.e. we look at whether a patient dies in the next year.
a single indicator, for instance say cholesterol level. Of course in reality you would have many other features.
for every time $t$ in the features we represent the "test value" for the past 2 years to the current year $t$. This means that we can iterate $t$ from 2002 (2000+2) to 2004 (2005-N)

Let's imagine the following data (to simplify I assume the time unit is year):

patientId birthYear year     indicator 
1         1987      2000     26
1         1987      2001     34
1         1987      2002     18
1         1987      2003     43
1         1987      2004     31
1         1987      2005     36
2         1953      2000     47
2         1953      2001     67
2         1953      2002     56
2         1953      2003     69
2         1953      2004     -    DEATH
3         1969      2000     37
3         1969      2001     31
3         1969      2002     25
3         1969      2003     27
3         1969      2004     15
3         1969      2005     -    DEATH
4         1936      2000     41
4         1936      2001     39
4         1936      2002     43
4         1936      2003     43
4         1936      2004     40
4         1936      2005     38

That would be transformed into this:

patientId yearT age indicatorT-2 indicatorT-1 indicatorT-0   label
1         2002  15  26           34           18             0
1         2003  16  34           18           43             0
1         2004  17  18           43           31             0
2         2002  49  47           67           56             0
2         2003  50  67           56           69             1
3         2002  33  37           31           25             0
3         2003  34  31           25           27             0
3         2004  35  25           27           15             1
4         2002  66  41           39           43             0
4         2003  67  39           43           43             0
4         2004  68  43           43           40             0

Note that I wrote the first two columns only to show how the data is calculated, these two are not part of the features.

OTHER TIPS

To clarify the questions raised by the user in response to the correct solution given by Erwan - the solution proposes going back in time to prepare the data across a series of timestamps.

There will be multiple points in time 't' where the input would be all the various features on the patients health, medication, reports etc..you need to see how best they can be converted to representational vectors. The labels would be a binary and indicate whether the patient lived after t+N days..where N can be 30,60,240 etc. 't' itself can be taken week on week or month on month.

Once data is prepared this way, it becomes a binary classification exercise.

The only additional consideration which can be added is - there could be elements of RNN here. The training data is not independent of one another and may contain recurring data of the same patient over multiple timestamps and perhaps there is a scope for capturing this information to model the situation better.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange