How to compute score and predict for outcome after N days
Question
Let's say I have a medical dataset/EHR dataset that is retrospective and longitudinal in nature. Meaning one person has multiple measurements across multiple time points (in the past). I did post here but couldn't get any response. So, posting it here
This dataset contains information about patients' diagnosis, mortality flag, labs, admissions, and drugs consumed, etc.
Now, if I would like to find out predictors that can influence mortality, I can use logistic regression (whether the patient will die or not).
But my objective is to find out what are the predictors that can help me predict whether a person will die in the next 30 days or the next 240 days, how can I do this using ML/Data Analysis techniques?
In addition, I would also like to compute a score that can indicate the likelihood that this person will die in the next 30 days? How can I compute the scores? Any tutorials links on how is this score derived?, please?
Can you please let me know what are the different analytic techniques that I can use to address this problem and different approaches to calculate score?
I would like to read and try solving problems like this
Solution
This could be seen as a "simple" binary classification problem. I mean the type of problem is "simple", the task itself certainly isn't... And I'm not even going to mention the serious ethical issues about its potential applications!
First, obviously you need to have an entry in your data for a patient's death. It's not totally clear to me if you have this information? It's important that whenever a patient has died this is reported in the data, otherwise you cannot distinguish the two classes.
So the design could be like this:
- An instance represents a single patient history at time $t$, and it is labelled as either alive or dead at $t+N$ days.
- This requires refactoring the data. Assuming data spans a period from 0 to $T$, you can take multiple points in time $t$ with $t<T-N$ (for instance every month from 0 to $T-N$). Note that in theory I think that different times $t$ for the same patient can be used in the data, as long as all the instances consistently represent the same duration and their features and labels are calculated accordingly.
- Designing the features is certainly the tricky part: of course the features must have values for all the instances, so you cannot rely on specific tests which were done only on some of the patients (well you can, but there is a bias for these features).
- To be honest I doubt this part can be done reliably: either the features are made of standard homogeneous indicators, but then these indicators are probably poor predictors of death in general; or they contain specialized diagnosis tests for some patients but then they are not homogeneous across patients, so the model is going to be biased and likely to overfit.
Ideally I would recommend splitting between training and test data before even preparing the data in this way, typically by picking a period of time for training data and another for test data.
Once the data is prepared, in theory any binary classification method can be applied. Of course a probabilistic classifier can be used to predict a probability, but this can be misleading so be very careful: the probability itself is a prediction, it cannot be interpreted as the true chances of the patient to die or not. For example Naive Bayes is known to empirically always give extreme probabilities, i.e. close to 0 or close to 1, and quite often it's completely wrong in its prediction. This means that in general the predicted probability is only a guess, it cannot be used to represent confidence.
[edit: example]
Let's say we have:
- data for years 2000 to 2005
- N=1, i.e. we look at whether a patient dies in the next year.
- a single indicator, for instance say cholesterol level. Of course in reality you would have many other features.
- for every time $t$ in the features we represent the "test value" for the past 2 years to the current year $t$. This means that we can iterate $t$ from 2002 (2000+2) to 2004 (2005-N)
Let's imagine the following data (to simplify I assume the time unit is year):
patientId birthYear year indicator
1 1987 2000 26
1 1987 2001 34
1 1987 2002 18
1 1987 2003 43
1 1987 2004 31
1 1987 2005 36
2 1953 2000 47
2 1953 2001 67
2 1953 2002 56
2 1953 2003 69
2 1953 2004 - DEATH
3 1969 2000 37
3 1969 2001 31
3 1969 2002 25
3 1969 2003 27
3 1969 2004 15
3 1969 2005 - DEATH
4 1936 2000 41
4 1936 2001 39
4 1936 2002 43
4 1936 2003 43
4 1936 2004 40
4 1936 2005 38
That would be transformed into this:
patientId yearT age indicatorT-2 indicatorT-1 indicatorT-0 label
1 2002 15 26 34 18 0
1 2003 16 34 18 43 0
1 2004 17 18 43 31 0
2 2002 49 47 67 56 0
2 2003 50 67 56 69 1
3 2002 33 37 31 25 0
3 2003 34 31 25 27 0
3 2004 35 25 27 15 1
4 2002 66 41 39 43 0
4 2003 67 39 43 43 0
4 2004 68 43 43 40 0
Note that I wrote the first two columns only to show how the data is calculated, these two are not part of the features.
OTHER TIPS
To clarify the questions raised by the user in response to the correct solution given by Erwan - the solution proposes going back in time to prepare the data across a series of timestamps.
There will be multiple points in time 't' where the input would be all the various features on the patients health, medication, reports etc..you need to see how best they can be converted to representational vectors. The labels would be a binary and indicate whether the patient lived after t+N days..where N can be 30,60,240 etc. 't' itself can be taken week on week or month on month.
Once data is prepared this way, it becomes a binary classification exercise.
The only additional consideration which can be added is - there could be elements of RNN here. The training data is not independent of one another and may contain recurring data of the same patient over multiple timestamps and perhaps there is a scope for capturing this information to model the situation better.