Question

I have a list of people, registration times, and scores. In Stata I want to calculate a moving average of score based on a time window around each observation (not a window based on lagging/leading number of observations).

For example, assuming +/- 2 days on either side and not including the current observation, I'm trying to calculate something like this:

user_id   day     score  window_avg
A         1        1     1.5             = (avg of B and C)
B         1        2     1               = (avg of A and C)
C         3        1     2.25            = (avg of A, B, D, and E)
D         4        3     2               = (avg of C and E)
E         5        3     2.5             = (avg of C, D, F, and G
F         7        1     4               = (avg of E and G)
G         7        5     2               = (avg of E and F)
H         10       3     .               = blank

I've attempted to define the dataset with tsset and then use tssmooth, but couldn't get it to work. Since there may be multiple observations for a given time period I'm not sure this is even the right approach. Also, in reality the day variable is a tc timestamp.

Was it helpful?

Solution

tsset can't help here even if you made your times regularly spaced, as you have some repeated values for time, yet your data do not qualify as panel data in Stata's sense. But the problem should yield to a loop over possibilities. First, let's take your example literally using integer days.

gen window_avg = . 
su day, meanonly 

qui forval d = `r(min)'/`r(max)' { 
    su score if inrange(day, `d' - 2, `d' + 2), meanonly 
    replace window_avg = (r(sum) - score) / (r(N) - 1) if day == `d'
}  

Here we assume no missing values. The principle to carry forward is

average of others = (sum of all - this value) / (number of values - 1)

which is discussed at http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/

In practice, you don't want to loop over all possible date-times in milliseconds. So, try a loop over observations of this form. Note <pseudocode> elements.

gen window_avg = . 

qui forval i = 1/`=_N' { 
    su score if inrange(date, <date[`i'] - 2 days>, <date[`i'] + 2 days>), meanonly 
    replace window_avg = (r(sum) - score) / (r(N) - 1) in `i' 
}  

This paper is also relevant:

Cox, N.J. 2007. Events in intervals. Stata Journal 7: 440-443. http://www.stata-journal.com/sjpdf.html?articlenum=pr0033

If missings are possible, one line needs to be more complicated:

    replace window_avg = (r(sum) - cond(missing(score), 0, score)) / (r(N) - !missing(score)) in `i' 

meaning that if the current value is missing, we subtract 0 from the sum and 0 from the count of observations.

EDIT: For 2 days in milliseconds, exploit the inbuilt function and use cofd(2).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top