Question

I would like to do some survival regression about the duration before the "death" of an individual. The final purpose is to know, given an individual, how long it should take before he'll most likely "die" (probability of the survival function to be less than 0.1 for instance).

My problem here is that I have, in my training set, a variable that influences a lot my target variable, but is not available for the test set (and won't happen in real life).

Let's say my training data is the following:

id   status   poison_time  death_time     sex
 0     1          90           92          f
 1     0          90          150          f 
 2     1          90           91          f  
 3     1          60          130          m
 4     0          60          150          m
 5     1          60           62          m

With :

  • status = 1 for a dead person and 0 for a censored data
  • poison_time : time corresponding to the injection of a poison
  • death_time : time of the death or last follow-up
  • sex : sex of the individual (not relevant here, imagine a bunch of useful variables)

I can't just ignore the influence of poison_time: although for some individuals, the poison won't be as effective (individual with id 3, or individuals that ended up right-censored). It has a real impact on death_time.

In my test data the poison is not injected, but I still would like to have a good idea of "how long should it take before an individual most likely die", knowing my other variables (sex, etc.)

Is it possible to still have relevant results with such corrupted data as a training set?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top