Pergunta

Currently I have a program that processes raw data in SAS, running queries like the following:

/*this code joins the details onto the spine, selecting the details 
 that have the lowest value2 that is greater than value, or if there are none
 of those, then the highest value2*/
/*our raw data*/
data spine; 
    input id value; 
    datalines;
    1 5
    2 10
    3 6
;
run;

data details;
    input id value2 detail $; 
    datalines;
    1 6 foo
    1 8 bar
    1 4 foobar
    2 8 foofoo
    2 4 barbar
    3 6 barfoo
    3 2 foobarfoo
;
run;

/*cartesian join details onto spine, split into highs and lows*/    
proc sort data = spine;
by id; 
run; 

proc sort data= details;
    by id; 
run;

data lows highs;
    join spine details;
    by id;
    if value2 ge value then output highs;
    else output lows;
run;

/*grab just the first/last of each set*/
proc sort data =lows;
by id value2;
run;

proc sort data = highs;
by id value2; 
run;

data lows_lasts;
    set lows;
    by id;
    if last.id;
run;

data highs_firsts;
    set highs;
    by id;
    if first.id;
run;

/*join the high value where you can*/
data join_highs;
    merge spine(in = a)
    highs_firsts ;
    by id;
    if a;
run;

/*split into missing and not missng*/
data not_missing still_missing;
    set join_highs;
    if missing(value2) then output still_missing;
    else output not_missing; 
run;

/*if it doesn't already have a detail, then attach a low*/
data join_lows;
    merge still_missing(in = a)
    lows_lasts ;
    by id; 
    if a;
run;

/*append the record that had a high joined, and the record that had a low joined, together*/ 
data results;
    set not_missing join_lows;
run; 

You get the picture. There are a lots of these kinds of data processing statements, which are run on a weekly basis, on new records.

Also doing data transformation (eg cleansing/parsing addresses).

Now - this kind of processing could be done using SQL.

The question is - would this be appropriate use of a relational database? Or should databases only be used for data storage and retrieval?

Consider that we are talking about tables with upto 10 million rows.

Foi útil?

Solução

SAS is designed for operations like what you have mentioned! You definitely can do most of these processing in an RDBMS, but the analytical capabilities of a typical RDBMS are limited compared to SAS functions. There are some elegant structures also in SAS, for example an SQL alternative of SAS First and Last processing is more cumbersome. The codes can be organized and scheduled as flows easily in a weekly basis.

The main question is that why you would like to use an RDBMS, what advantage of it would you benefit of? Two potential advantages came into my mind:

  • Multiuser usage: RDBMS's built-in locking and transactions management allow multiple database users to access a database simultaneously. I think this is not the case in your situation.
  • Loads of records: In case of billions of records, SAS data sets can have a size that can become difficult to handle even with COMPRESS option switched on (speed can be an issue also). In such cases, it can be a good practice to store data in RDBMS and access it through SAS/Access interface. I think this is not the case in your situation either.

If you go with an RDBMS from SAS solution, notice that you might have to rewrite some parts of your code, as some SAS functions does not work with RDBMS-s if you run them with the libname method. The use of FedSQL (SAS implementation of ANSI SQL:1999 core standard introduced in SAS 9.4) is solving this problem.

Also note that in some special cases, you can have different results using SAS to using an RDBMS.

Outras dicas

As you have noted, this type of processing can be done in an RDBMS but not just about any RDBMS. You would need an RDMBS that is compliant with the latest ANSI SQL standard or at least support WINDOW functions with PARTITION feature. Ex: PostgreSQL, Teradata, SQL Server 2012 etc... note: MySQL is missing as it does not support WINDOW functions.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top