SAS: Improving the speed of a do loop with proc import

https://stackoverflow.com/questions/21291306

01-10-2022
|

Question

I have over 3400 CSV files, with size varying between 10kb to 3mb. Each CSV files have this generic filename: stockticker-Ret.csv where stockticker is the stock ticker like AAPL, GOOG, YHOO, and so on and has stock returns for at every minute on a given day. My SAS code first start by loading all the stock ticker names from the stockticker-Ret.csv file in a SAS dataset. I loop over each ticker to load the appropriate .csv file in a SAS dataset called want and apply some datasteps on want and store the final dataset want of each ticker in a SAS dataset called global. As you can imagine, this process takes a long time. Is there a way to improve my DO LOOP code below to make this process go faster?

/*Record in a sas dataset all the csv file name to extract the stock ticker*/
    data yfiles;
    keep filename;
    length fref $8 filename $80; 
    rc = filename(fref, 'F:\data\'); 
    if rc = 0 then do; did = dopen(fref); 
    rc = filename(fref); end; else do; length msg $200.; msg = sysmsg(); put msg=; did = .; end;
    if did <= 0 then putlog 'ERR' 'OR: Unable to open directory.';
    dnum = dnum(did);
    do i = 1 to dnum; filename = dread(did, i); /* If this entry is a file, then output. */ fid = mopen(did, filename); if fid > 0 then output; end;
    rc = dclose(did);
    run;

/*store in yfiles all the stock tickers*/
    data yfiles(drop=filename1 rename=(filename1=stock));
    set yfiles;
    filename1=tranwrd(filename,'-Ret.csv','');
    run;

    proc sql noprint;
    select stock into :name separated by '*' from work.yfiles;
    %let count2 = &sqlobs;
    quit;


    *Create the template of the desired GLOBAL SAS dataset;
    proc sql;
    create table global
    (stock char(8), time_gap num(5), avg_ret num(5));
    quit;

    proc sql;
    insert into global
    (stock, time_gap,avg_ret)
    values('',0,0);
    quit;

    %macro y1;
    %do i = 1 %to &count2;
    %let j = %scan(&name,&i,*);
    proc import out = want datafile="F:\data\&j-Ret.csv"
    dbms=csv replace;
    getnames = yes;
    run;


    data want;
    set want; ....

    ....[Here I do 5 Datasteps on the WANT sasfile] 


/*Store the want file in a global SAS dataset that will contain all the stock tickers from the want file*/

    data global;
    set global want; run;

    %end;
    %mend y1;
    %y1()

As you can see the global SAS dataset expands for every want dataset that I store in global.

Solution

Assuming the files have a common layout, you should not import them with PROC IMPORT or do loops. You should read them all in with one datastep. IE:

data want;
length the_file $500;
infile "f:\data\*.csv" dlm=',' lrecl=32767 dsd truncover firstobs=2 filename=the_file;
input
myvar1 myvar2 myvar3 myvar4;
stock_ticker=scan(the_file,'\',-1); *or whatever gets you the ticker name;
run;

Now, if they don't have identical layouts, or there is some complexity to the readin, you may need a more complex input statement than that, but almost always you can achieve it this way. Do loops with lots of PROC IMPORTs will always be inefficient because of the overhead of the IMPORT.

If you don't want every .csv file in the folder (and can't write a mask for what you do want), or if you have a subset of layouts, you can use the FILEVAR option to read the files in from a common dataset. You could then branch into various input statements, perhaps, if needed.

data want;
set yfiles;
infile a filevar=filename;
if filevar [some rule] then do;
input ... ;
end
;else if ... then do;
input ... ;
end;
run;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow