Question

I have a panel data set for multiple waves (13) for roughly 10,000 individuals each year, with people entering and exiting at various time points. I am interested in what happens as people become diagnosed with a disease over time. Therefore I need to recode the time variable so that it becomes t=0 the first wave when diagnosed, then t=1 is the next year and so on, so that all of my individuals are comparable (and I guess -1 for t-1 etc). However I am unsure about how to go about this in stata. Would anyone be able to advise? Many thanks

Was it helpful?

Solution 2

Simple but not optimal solution

Suppose diagnosis is 1 when diagnosed (at most once per person) and 0 otherwise. Then the time at diagnosis is at its simplest

 egen time_diagnosis = total(diagnosis * year), by(id) 

but you need to ignore any zeros. To spell that out,

 replace time_diagnosis = . if time_diagnosis == 0 

Better alternative

A more complicated but preferable alternative can handle multiple diagnoses if they occur:

 egen time_diagnosis = min(year / diagnosis), by(id) 

as year / diagnosis is year when diagnosis is 1 and missing otherwise. This yields missing values if there is no diagnosis, which is as it should be.

Then you subtract that to get a new time variable.

 gen time2 = time - time_diagnosis 

In short, I think you can get this done in two statements, handling panel structure too.

Update

@Richard Herron asks why use egen with by(), and not just

 gen time_diagnosis = time * diagnosis 

A limitation of that is that the "correct" value is contained only in those observations for which diagnosis is 1; that value still has to be "spread" to other values for the same id. But that is precisely what egen does here. In the simplest situation, with one diagnosis the total of time * diagnosis is just time * 1 or time, as any zeros make no difference to the sum.

OTHER TIPS

The case of one diagnosis per person

clear all
set more off

*----- example data -----

set obs 100
set seed 2357

generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 5

bysort id: replace year = year + _n
bysort id (year): generate diag = cond(_n == 3, 1, 0)

list in 1/20, sepby(id)

*----- what you seek -----

bysort id (diag): gen time = year - year[_N]

sort id year
list in 1/20

I assume the same data structure as @RichardHerron and use his example. diag is an indicator variable that takes on the value of 1 at the time of diagnosis and 0 otherwise (only one diagnosis per person is considered).

The sorting done by bysort is critical. The observation holding the time of diagnosis is pushed to the end of the database (by id groups) and then all that's left to do is compare (subtract) all years with that reference year. See help _variables for details on system variables like _N.

The case of multiple diagnoses per person

If several diagnoses are made per person, but we care only for the first occurence (according to year), we could do:

gsort id diag -year
by id: gen time = year - year[_N]

It is usually helpful to provide test data, but here they are easy enough to generate. The trick is to find the first year for each individual (my fyear), which I'll do with min() from egen. Then I'll subtract this first year fyear from the actual year to find the year relative to diagnosis ryear.

/* generate panel */
clear
set obs 10000
generate id = _n
generate year = floor(10 * runiform()) + 1990
expand 10
bysort id: replace year = year + _n
sort id year
list in 1/20

/* generate relative year */
bysort id: egen fyear = min(year)
generate ryear = year - fyear
list in 1/20

If the first year in the panel is not diagnosis, then just construct fyear based on diagnosis criteria.


Edit: Thinking more on this, maybe it's the last part that you're having a hard time with (i.e., identifying the diagnosis year to subtract from the calendar year). Here's what I would do.

bysort id (year): generate diagnosis = cond(_n == 5, 1, 0)
preserve
tempfile diagnosis
keep if (diagnosis == 1)
rename year dyear
keep id dyear
save `diagnosis'
restore
merge m:1 id using `diagnosis', nogenerate
generate ryear2 = year - dyear
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top