SAS name error when using a variable from a looping macro inside a hash dataset statement

StackOverflow https://stackoverflow.com/questions/23194721

문제

Any advice on how I can modify this problem code line below to get my datasets named without error?

I have a dataset where I want to match treatment firms (4400) with about 100,000 control firms by 48 industries and 14 years, then nearest size without replacement. My method below may be a bit clunky, but I am learning as I go. I am splitting the treatment and control datasets into 48x14 groups, (after this I will try to run nearest match without replacement code through some type of loop).

I have already used a variant of the below hash code to make 48 datasets out1...out48. Now I am attempting to further subset each of these 48 datasets into the 14 years through using the below code. I am receiving an error dataset creation line, where I am trying to create 48x14 datasets named out12004, out22004, out32004. . .out482013

Problem code line (about 2/3 down the code below): hh.output (dataset: 'out'||&i||put(year, best.-L)) ;

The SAS errors look like this:

ERROR: The value OUT 12004 is not a valid SAS name.

ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".

Here is the full code (modified from Is it possible to loop over SAS datasets? andSUGI30 paper 236-30

 %MACRO process_datasets(mdataset);
      data &mdataset.;
      set &mdataset.;
        data _null_ ;
        dcl hash hoh (ordered: 'a') ;
        dcl hiter hih ('hoh' ) ;
        hoh.definekey ('year' ) ;
        hoh.definedata ('year', 'hh' ) ;
        hoh.definedone () ;
        dcl hash hh () ;
        do _n_ = 1 by 1 until ( eof ) ;
        set out&i. end = eof ;
        if hoh.find () ne 0 then do ;
        hh = _new_ hash (ordered: 'a') ;
        hh.definekey ('industry','cik', 'year', '_n_') ;
        hh.definedata ('industry','cik','year','eventdat', 'at', 'roa') ;
        hh.definedone () ;
        hoh.replace () ;
        end ;
        hh.replace() ;
        end ;
        do rc = hih.next () by 0 while ( rc = 0 ) ;
        hh.output (dataset: '_'||&i||put(year, best.-L)) ;
        rc = hih.next() ;
        end ;
        stop ;
     run;

     data _null_;
       file 'tmp.csv' mod dsd dlm=',';  *saving everything to the same file;
       set &mdataset.;
       put (_all_) (+0);
     run;

%MEND process_datasets;

%MACRO loop_through_all;
    %DO i = 1 %to 48;
       %process_datasets(out&i.);
     %END;
%MEND loop_through_all;
도움이 되었습니까?

해결책

First, a few notes about the technical points in the other answer - ie, "Where the problem is not directly coming from, although both are examples of poor coding."

&i is indeed accessible here, although I would suggest it is poor style to use it the way you do. Macro variables that are relied upon in interior macros should be defined as macro parameters; that makes it clear where they came from. However, technically, this isn't wrong, see this:

%macro caller;
%do i=1 %to 5;
  %called;
%end;
%mend;
%macro called;
%put &i;
%mend called;

%caller;

However, it would be better to make i a paramter, such as %macro called(i=);, to make your macro more clear, and more reusable.

Second, lack of quotes is in fact not the direct problem, although again it does point to an issue and is a solution in a way. SAS does convert the numerics in that to a character value - otherwise you'd get a very different error message; however, it does so in a way that is not helpful. The most similar implementation of what you did is to add compress around it. That is because the problem is how SAS converts numerics to text; &i is a number (1 in your example). It needs to be converted to "1", and instead it is converted using best12. to " 1". That is a problem.

hh.output (dataset: compress('_'||&i||put(year, best.-L))) ;

That works. A better implementation would be to intentionally convert to a character value. Macro parameters are very easy to convert: just add " " around them.

hh.output(dataset: cats('_',"&i.",year);

cats strips all spaces off, and makes the -L unnecessary. It would work just as well with &i, although it's certainly better to add quotes.

I would add that you might consider why you're subsetting these. I don't think there's anything conceptually wrong with doing so, but odds are if you are doing something by year, you can use by year and get away with not subsetting them - keeping them in one dataset per treatment group (and perhaps even by group year?). Further, you might be able to do this in fewer steps. What are you going to do, finally? Let's say you had one dataset for each group/year. What code would you then run? It may be that you can write that in one or a few steps without breaking out 48x14 datasets, which is probably not efficient. If you're interested in finding out, start a new question with the details of what you'd like to do with just a pair of datasets.

다른 팁

To address your specific point regarding the output line:-

hh.output (dataset: '_'||&i||put(year, best.-L)) ;

A few notes:-

  • The argument for dataset must be quoted, e.g. ..(dataset: "Out123"). Currently, yours is not.
  • The major macro does not have access to the macro variable &i. This is created in the calling macro and is available in that scope only. You could add another parameter to the major macro and send &i that way.

In truth, though, the code your creating looks very difficult to maintain; a hash in a macro is a nightmare to debug. And if you're producing code that creates lots of output datasets that should set alarm bells ringing; the by statement in SAS allows the applying of criteria to separate groups within a dataset and is definitely preferable to many datasets.

Let's look at the problem: you're matching controls to treatment using variables that will yield multiple control groups per treatment? You would then choose a control per treatment based on some distance criteria?

For the first part, a Proc SQL merge sounds about right. You'll get a long dataset with each treatment firm repeated the number of times it was matched to a control. Then sort by treatment descending [distance criteria] and pick the first one per by group. And that should be that. Of course, I'm sure I've misunderstood something...

A final point; you could just look for matching algorithms in SAS, particularly 'optimal matching' or 'greedy matching'. In my experience, SAS is not great for matching, especially if it involves a random element (yours doesn't), but you should be able to find code more useful than what you're working with right now.

I think you may be able to simplify your above code greatly by using wildcards on your set statement instead of using multiple set statements to "loop over them".

For example, the below code would step through all of your datasets that begin with one of the prefixes so that you can work on them without multiple set statements.

data all;
  set out1:
      out2:
      out3:
      out4:
      ;
run;

That may even allow you to remove the need for a macro which in turn would simplify the code. That existing code looks very difficult to maintain/debug so I think simplifying it is the first step.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top