Pregunta

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.

The original subscriptions dataset looks like this..

id  start_date  end_date
1   2013-06-01  2013-08-25
2   2013-06-01  NA
3   2013-08-01  2013-09-12

Which I manipulate to look like this..

id  tenure_in_months status(1=cancelled, 0=active)
1   2                1
2   ?                0
3   1                1

..in order to feed the survival model:

obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)

What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?

¿Fue útil?

Solución 2

If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.

NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.

SQL code to get the time till event (use in SELECT part of query)

DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months

BTW: I would use difference in days, for my analysis. Does not make sense to round off the time to months.

Otros consejos

First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.

This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29

So your data would need to translate from

id  start_date  end_date
1   2013-06-01  2013-08-25
2   2013-06-01  NA
3   2013-08-01  2013-09-12

to:

id  t1   t2    status(3=interval_censored)
1   2    2           3
2   3    NA          3
3   1    1           3

Then you will need to change your R surv object, from:

Surv(time=tenure_in_months, event=status, type="right")

to:

Surv(t1, t2, event=status, type="interval2")

See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm

Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.

You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.

Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top