Вопрос

I'm very new therefore very fresh in the world of SAS , although I've used SAS last year during my studies but theoretical knowledge is not the same as hands-on knowledge.

Here is my problem.

I have tables on SAS, that look like the example below:

table1

date    var_1   var_2   var_3   var_4   var_5
1957M1   .       .      .       .       .
1957M2   .       .      .       .       23.5
1957M3   .       1.2    .       .       23.6
1957M4   .       1.3    .       .       23.7
1957M5   .       1.4    .       0.123   23.8
1957M6   .       1.5    .       0.124   23.9
1957M7   .       1.6    3.0     0.125   23.10
1957M8   .       1.7    3.1     0.126   23.11
1957M9   .       1.8    3.2     0.127   23.12
1957M10  2.1     1.9    3.3     0.128   23.13
1957M11  2.2     1.10   3.4     0.129   23.14
1957M12  2.3     1.11   3.5     0.130   23.15

As you have guessed, each variable is a time serie on its own right and date is a a time series as well. The columns are numeric ones except the date column, which is a character one.

My aim is to know for each of these variables , their respective start date and their latest date.

var_1 would start 1957 in October (or M10) and the latest date would be in l957 in December (or M12).

var_4 would start 1957 in October (or M10) and the latest date would be in December (or M12).

I've tried the following through SAS for one column for one table as a test but it is taking a hell of a long time , without results.

PROC SQL NOPRINT;
SELECT 
    MIN(input(substr(date,1,4),date4.)),
    MAX(input(substr(date,1,4),date4.))
FROM
table1
WHERE 
var_2 <> "."
quit;

For my query, the date column is in text. I'm trying through my query to transform it into a date format with the year only although as I will only be with the year and having the months would be great.

My boss told me about PROC FREQ to achieve the results I want but I have no idea how.

If you have any clues, I'm taking it.

Cheers.

Это было полезно?

Решение

Your problem is that your data structure isn't really right for your problem.

The right data structure is a more vertical structure, with DATE, VAR, VALUE. Then PROC MEANS is perfect for your needs.

data have;
input date $    var_1   var_2   var_3   var_4   var_5;
datalines;
1957M1   .       .      .       .       .
1957M2   .       .      .       .       23.5
1957M3   .       1.2    .       .       23.6
1957M4   .       1.3    .       .       23.7
1957M5   .       1.4    .       0.123   23.8
1957M6   .       1.5    .       0.124   23.9
1957M7   .       1.6    3.0     0.125   23.10
1957M8   .       1.7    3.1     0.126   23.11
1957M9   .       1.8    3.2     0.127   23.12
1957M10  2.1     1.9    3.3     0.128   23.13
1957M11  2.2     1.10   3.4     0.129   23.14
1957M12  2.3     1.11   3.5     0.130   23.15
;;;; 
run;

data want;
set have;
array var_[5];
date_num = mdy(substr(date,6),1,substr(date,1,4));
do _iter= 1 to dim(var_);
  if not missing(var_[_iter]) then do;
   var = vname(var_[_iter]);
   value = var_[_iter];
   output;
  end;
end;
format date_num MONYY.;
run;

proc means data=want;
class var;
var date_num;
output out=edge_dates min= max= /autoname;
run;

Другие советы

If performance is an issue, this is the fastest way I know as it only needs to read the data once. Use Joe's code to create the 'have' dataset:

data want;
  format date_num start1-start5 end1-end5 monyy.;

  set have end=eof;
  retain start1-start5 end1-end5 .;                  * RETAIN THE VALUES WE WILL BE CALCULATING AS WE ITERATE ACROSS ROWS IN THE DATASET;

  array arr_var  [*] var_1-var_5  ;                  * ARRAY FOR EXISTING VARIABLES;
  array arr_start[*] start1-start5;                  * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN START DATE;
  array arr_end  [*] end1-end5    ;                  * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN END DATE;

  date_num = mdy(input(substr(date,6),best.),1,input(substr(date,1,4),best.));

  do iter=1 to dim(arr_var);                         * LOOPING FOR THE NUMBER OF VARIABLES IN ARR_VAR;
    if arr_var[iter] ne . then do;                   * ONLY GOING TO PERFORM CALCS WHEN THE VARIABLE IS NOT MISSING;

      if arr_start[iter] eq . then do;
       arr_start[iter] = date_num;                   * ONLY UPDATE THE START DATE IF IT HASNT ALREADY BEEN SET;
      end;
      arr_end[iter] = date_num;                      * IF ITS NOT MISSING, ALWAYS UPDATE THE END DATE;

    end;
  end;

  if eof then do;
    output;                                          * ONLY OUTPUT THE CALCULATED VALUES ONCE WE HIT THE END OF THE DATASET;
  end;

  keep start: end:;                                  * KEEP ONLY VARS STARTING WITH START OR END;
run;

Normally I wouldn't recommend calculating start and end dates this way unless performance is a consideration, or if this code ends up being simpler than the alternative.

Most of the time you're better off prepping the data structure differently - although there's certainly circumstances where having the data in the above format is advantageous as well.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top