Question

I have a dataset like the one below, and I am trying to take a running total of events 2 and 3, with a slight twist. I only want to count these events when the Event_1_dt is less than the date in the current record. I'm currently using a macro %do loop to iterate through each record for that item type. While this produces the desired results, performance is slower than desirable. Each Item_Type may have up to 1250 records, and there are a couple thousand types. Is it possible to exit the loop before it cycles through all 1250 iterations? I am hesitant to try joins because there are some 30+ events to count up, but I'm open to suggestions. An additional complication is that even though Event_1_dt is always greater then Date, is does not have any other limitations.

Item_Type   Date        Event_1_dt  Event_2_flg Event_3Flg  Desired_Event_2_Cnt Desired_Event_3_Cnt
   A        1/1/2014    1/2/2014    1           1           0                   0
   A        1/2/2014    1/2/2014    0           1           0                   0
   A        1/3/2014    1/8/2014    1           0           1                   2
   B        1/1/2014    1/2/2014    1           0           0                   0
   B        1/2/2014    1/5/2014    1           0           0                   0
   B        1/3/2014    1/4/2014    1           1           1                   0
   B        1/4/2014    1/5/2014    0           1           1                   0
   B        1/5/2014    .           1           1           2                   1
   B        1/6/2014    1/7/2014    1           1           3                   2

Corresponding Code:

%macro History;
data y;
set x;
Event_1_Cnt = 0;
Event_2_Cnt = 0;
%do i = 1 %to 1250;
    lag_Item_Type = lag&i(Item_Type);
    lag_Event_2_flg = lag&i(Event_2_flg);
    lag_Event_3_flg = lag&i(Event_3_flg);
    lag_Event_1_dt = lag&i(Event_1_dt);

    if Item_Type = lag_Item_Type and lag_Event_1_dt > . and lag_Event_1_dt < Date then do;

        if lag_Event_2_flg  = 1 then do;
            Event_2_Cnt = Event_2_cnt + 1;
        end;

        if lag_Event_3_flg = 1 then do;
            Event_3_Cnt = Event_3_cnt + 1;
        end;

    end;
%end;

run;

%mend;

%History;

Was it helpful?

Solution

Well, that's not a trivial task for SAS, but still it can be solved in one DATA-step, without merging. You can use hash objects. The idea is as follows. Within each item type, going record by record, we 'collect' event flags into 'bins' in a hash object, where each bin is a certain date. All bins are ordered by date in ascending order. Simultaneously, we insert the Date of the current record into the same hash (into corresponding place by date) and then iterate 'up' from this place, summing up all gathered by this moment bins (which will have dates < then date of the current record, since we going up).

Here's the code:

data have;
    informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.; 
    infile datalines dsd dlm=',';
    format Date Event_1_dt date9.;
    input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;
run;
proc sort data=have; by Item_Type; run;

data want;
    set have;
    by Item_Type;
    if _N_=1 then do;
        declare hash h(ordered:'a');
        h.defineKey('Event_date','type');
        h.defineData('event2_cnt','event3_cnt');
        h.defineDone();
        declare hiter hi('h');
    end;

    /*for each new Item_type we clear the hash completely*/

    if FIRST.Item_Type then h.clear();

    /*now if date of Event 1 exists we put it into corresponding   */
    /* (by date) place of our ordered hash. If such date is already*/
    /*in the hash, we increase number of events for this date      */
    /*adding values of Event2 and Event3 flags. If no - just assign*/
    /*current values of these flags.*/

    if not missing(Event_1_dt) then do;
        Event_date=Event_1_dt;type=1;
        rc=h.find();
        event2_cnt=coalesce(event2_cnt,0)+Event_2_flg;
        event3_cnt=coalesce(event3_cnt,0)+Event_3_flg;
        h.replace();
    end;

    /*now we insert Date of the record into the same oredered hash,*/
    /*making type=0 to differ this item from items where date means*/
    /*date of Event1 (not date of record)*/

    Event_date=Date;
    event2_cnt=0; event3_cnt=0; type=0;
    h.replace();
    Desired_Event_2_Cnt=0;
    Desired_Event_3_Cnt=0;

    /*now we iterate 'up' from just inserted item, i.e. looping     */
    /*through all items that have date < the date of the record.    */
    /*Items with date = the date of the record will be 'below' since*/
    /*they have type=1 and our hash is ordered by dates first, and  */
    /*types afterwards (1's will be below 0's)*/

    hi.setcur(key:Date,key:0);
    rc=hi.prev();
    do while(rc=0);
        Desired_Event_2_Cnt+event2_cnt;
        Desired_Event_3_Cnt+event3_cnt;
        rc=hi.prev();
    end;
    drop Event_date type rc  event2_cnt event3_cnt;
run;

I can't test it with your real number of rows, but I believe it should be pretty fast, since we loop only through a small hash object, which is entirely in memory, and we do only as many loops for each record as necessary (only earlier events) and don't do any IF-checks.

OTHER TIPS

I dont think a Hash is neccessary for this - it seems like a simple data-step will do the trick. This might prevent you (or the next programmer who comes across your code) from needing to 're-read and do research' in order to understand it.

I think the following will work:

data have;
    informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.; 
    infile datalines dsd dlm=',';
    format Date Event_1_dt date9.;
    input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;



data want2 (drop=_: );
    set have; 
    by ITEM_Type;

    length _Alldts_event2 _Alldts_event3 $20000;
    retain _Alldts_event2 _Alldts_event3;

    *Clear _ALLDTS for each ITEM_TYPE;
    if first.ITEM_type then Do;
        _Alldts_event2 = "";
        _Alldts_event3 = "";
    END;

    *If event is flagged, concatenate the Event_1_dt to the ALLDTS variable;
    if event_2_flg = 1 Then _Alldts_event2 = catx(" ", _Alldts_event2,Event_1_dt);
    if event_3_flg = 1 Then _Alldts_event3 = catx(" ", _Alldts_event3,Event_1_dt);
    _numWords2 = COUNTW(_Alldts_event2); 
    _numWords3 = COUNTW(_Alldts_event3);

    *Loop through alldates, count the number that are < the current records date;
    cnt2=0;
    do _i = 1 to _NumWords2;
        _tempDate =  input(scan(_Alldts_event2,_i),Best12.);
        if _tempDate < date Then cnt2=cnt2+1;
    end;

    cnt3=0;
    do _i = 1 to _NumWords3;
        _tempDate =  input(scan(_Alldts_event3,_i),Best12.);
        if _tempDate < date Then cnt3=cnt3+1;
    end;
run;

I believe the Hash may be faster, but you'll have to decide on what tradeoff of comprehensibility/performance is appropriate.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top