Вопрос

I have 2 tables as below:

Table 1, a user listing table:

Year  Month Id Type 
2010  3     1  A
2010  5     2  B
2010  10    1  A
2010  12    1  A

Table 2 describes user promotion history:

Promote Date Id
2/20/2010    1
5/20/2010    1     (4/2010 the user got demoted, and after 1 month he got promote again)

From these 2 tables, I need to produce a result table that likes the table 1, but add a column which classifies a user with type A has been promoted in the last 3 months or more than 3 months at a specific date. For example, the results would be:

Year  Month Id | Duration
2010  3     1  | A < 3 months
2010  10    1  | A > 3 months
2010  12    1  | A > 3 months    

General idea would be:

  • I need to convert the month column and year column from table 1 to a date format like 3/2010
  • subtract the new converted value with the nearest promote date to the above date (2/2010) to get the numbers of days that user has been promoted
  • compare to 90 days to classifies his promoted duration

There are 2 issues that I'm currently stuck with.

I don't know the best way to convert the month column and year column to month/year date format.

Assumed that I already converted month/year column from table1, I use the Max function to get the nearest date from table2. As far as I know, the max function is not good for the performance, so is there any other solution instead of using max? In mysql, it's easy to solve by using Limit 1, but SAS proc-sql does not support Limit. Is there any equivalent to limit in proc-sql? Below is the code that I'm currently thinking of.

PROC SQL;
Create table Result as SELECT table1.Year, table1.Month, table1.Code, 
(Case When table1.Type = "B" then "B"
When table1.Type = "A" AND (table1.Date - (Select MAX(table2.Date) From table2 Where table2.Date <= table1.Date AND table2.Id = table1.Id ) < 90) THEN "A < 3 months"
When table1.Type = "A" AND (table1.Date - (Select MAX(table2.Date) From table2 Where table2.Date <= table1.Date AND table2.Id = table1.Id ) >= 90) THEN "A > 3 months"
When table1.Type = "C" then "C"
end) as NewType
From table1
LEFT JOIN
// .... 
;
QUIT;

As you can see, I need to left join the table1 with others tables so I use the sub query, which is also a bad performance too, but I don't know if there are any other way. Help and advice are appreciated.

Это было полезно?

Решение

You can create a date value from its using mdy() function, like this:

data have;
input Year  Month Id Type $;
datalines;
2010  3     1  A
2010  5     2  B
2010  10    1  A
2010  12    1  A
;
run;

data have;
set have;
format date date9.;
date = mdy(Month, 1, Year);
run;

You don't have a day value, so I just used 1 (every date created is the first of the month).

Now, you can join the two tables by ID, and calculate the difference from date in the first table and promotion date in the second table:

proc sql;
    create table want as
    select *
          ,abs(date - promote) as diff
    from have as a
           left join
         prom as b
           on a.id = b.id;
quit;

After that you sort the resulting table by id, date and diff:

proc sort data=want;
by id date diff; 
run;

After sorting dataset looks like this:

Year  Month  Id  Type   date       Promote    diff
---------------------------------------------------
2010  3      1   A      01MAR2010  20FEB2010  9
2010  3      1   A      01MAR2010  20MAY2010  80
2010  5      2   B      01MAY2010  .          .
2010  10     1   A      01OCT2010  20MAY2010  134
2010  10     1   A      01OCT2010  20FEB2010  223
2010  12     1   A      01DEC2010  20MAY2010  195
2010  12     1   A      01DEC2010  20FEB2010  284

Last step, iterate through the dataset and check if the first diff value for every ID and date value is lower, or greater then 3 months (I just checked against 90 days, you could use also intck function). Because we sorted the dataset by id, date and diff the first row should be the nearest to the date, so you output only the first row.

data want2(keep = year month id type duration);
set want;
by date;

if first.date and Type = 'A' then do;


if diff lt 90 then do;
    duration = 'A < 3 months';
    output want2;
end;
if diff gt 90 then do;
    duration = 'A > 3 months';
    output want2;

    end;
end;
else if first.date  then do;
    duration = type;
    output want2;
end;

run;

output statements are used because we want to keep only some of the rows (first one for each by group). The last output is there so that the rows with a type value different then A also stay in the final result.

This is the final result:

Year    Month    Id    Type    duration
--------------------------------------------
2010    3        1     A       A < 3 months
2010    5        2     B       B
2010    10       1     A       A > 3 months
2010    12       1     A       A > 3 months
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top