In survival analysis, which is the correct way to introduce a variable which changes the survival rate but occurs at different times?

https://datascience.stackexchange.com/questions/66282

20-10-2020
|

Pregunta

I am making a survival analysis with a cox regression with proportional hazards, we want to analyze wheter the introduction of a phenomenon influences the time until the death of an individual.

A similar example would be: We have patients which were given a medicine at different stages of the disease (and at different ages too and some were not given the medicine at all), how the treatment variable should be introduced so we can measure the "strength" of the introduced change.

I have a possible solution:

Create a dummy exogenous variable for ranges of the time in which the medicine was introduced. "First 0-6 months", "6 m - 1 year", "1-2 years", "Never", this measured since the birth of the patient.
Create the same exogenous variable but not dummy.
Change the time variable and measure it not from birth but from the introduction of the medicine (the problem would be the definition of this variable for the patients without medicine).
Duplicate the rows of the individuals who had the medicine. In the first repetition, the time would be between birth and the introduction of the medicine as a "censored death"; the second repetition would include the time between the introduction of the medicine and the actual death.

Nowadays, the model includes the introduction of the medicine as a dummy but doesn't have into account the time in which it was introduced.

Solución

The problem you'll run into if you are not careful is the "immortal time bias". In short, the problem is that a subject isn't "in" the "1-2 years" group until they atleast 1 year under observation. This 1 year period is called immortal because patients can't die then. More concretely, if I naively partition my population into "First 0-6 months" vs "1-2 years" and measure survival, the latter group is going to look like they have much better survival in the first year because in order to qualify for the latter group, you need to live longer.

So what do you do? You need to model the time-varying nature of your data. Check out the "long" format for survival data. Below I have a Python example that uses lifelines. There are four individuals, each will be treated in a different treatment period. We use multiple lines (but same id) to denote different time periods (note the mutually exclusive start/stop). A dummy variable of the treatment period is provided.

import pandas as pd
from lifelines import CoxTimeVaryingFitter


df = pd.DataFrame([
    {'id': 1, 'start': 0, 'stop': 12, 'E': 1, 't1': 0, 't2': 0, 't3': 0},  # never received treatment, died at t=12

    {'id': 2, 'start': 0, 'stop': 10, 'E': 1, 't1': 1, 't2': 0, 't3': 0},  # received treatment at very start

    {'id': 3, 'start': 0, 'stop': 3,  'E': 0, 't1': 0, 't2': 0, 't3': 0},   # will received treatment in second "period"
    {'id': 3, 'start': 3, 'stop': 9,  'E': 1, 't1': 0, 't2': 1, 't3': 0},  # received treatment in second "period"

    {'id': 4, 'start': 0, 'stop': 6,  'E': 0, 't1': 0, 't2': 0, 't3': 0},   # will received treatment in third "period"
    {'id': 4, 'start': 6, 'stop': 11, 'E': 1, 't1': 0, 't2': 0, 't3': 1},  # received treatment in third "period"

])



ctv = CoxTimeVaryingFitter().fit(df, id_col='id', start_col='start', stop_col='stop', event_col='E')

Another solution instead of using dummy variables is to create an interaction with the time variable above, but that will be harder to interpret I believe.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange