Question

is it possible to do a SAS hash lookup on a partial substring?

So the hash table key will contain: 'LongString' but my target table key has: 'LongStr'

(the target table key string length may vary)

Was it helpful?

Solution

You can but it's not pretty and you may not get the performance benefits you're looking for. Also, depending on the length of your strings and the size of your table you may not be able to fit all the hashtable elements into memory.

The trick is to first generate all of the possible substrings and then to use the 'multidata' option on the hashtable.

Create a dataset containing words we want to match against:

data keys;
  length key  $10 value $1;
  input key;
  cards;
LongString
LongOther
;
run;

Generate all possible substrings:

data abbreviations;
  length abbrev $10;
  set keys;
  do cnt=1 to length(key);
    abbrev = substr(key,1,cnt);
    output;
  end;
run;

Create a dataset containing terms we want to search for:

data match_attempts;
  length abbrev  $10;
  input abbrev ;
  cards;
L
Long
LongO
LongSt
LongOther
;
run;

Perform the lookup:

data match;
  length abbrev key $10;

  set match_attempts;

  if _n_ = 1 then do;
    declare hash h1(dataset:'abbreviations', multidata: 'y');
    h1.defineKey('abbrev');
    h1.defineData('abbrev', 'key');
    h1.defineDone();

    call missing(abbrev, key);
  end;

  if h1.find() eq 0 then do;
    output;
    h1.has_next(result: r);
    do while(r ne 0);
      h1.find_next();
      output;
      h1.has_next(result: r);
    end;
  end;

  drop r;
run;

Output (notice how 'Long' returns 2 matches):

Obs abbrev    key 
=== ========= ==========
1   Long      LongString 
2   Long      LongOther 
3   LongO     LongOther 
4   LongSt    LongString 
5   LongOther LongOther 

A few more notes. The reason the hash table will not support something like the like operator is because it 'hashes' the key prior to inserting a record into the hash table. When a lookup is performed the value to lookup is 'hashed' and then a match is performed on the hashed values. When a value is hashed even a small change in the value will yield a completely different result. Take the below example, hashing 2 almost identical strings yields 2 completely different values:

data _null_;
  length hashed_value $16;
  hashed_value = md5("String");
  put hashed_value= hex32.;
  hashed_value = md5("String1");
  put hashed_value= hex32.;
run;

Output:

hashed_value=27118326006D3829667A400AD23D5D98
hashed_value=0EAB2ADFFF8C9A250BBE72D5BEA16E29

For this reason, the hash table cannot use the like operator.

Finally, thanks to @vasja for some sample data.

OTHER TIPS

You have to use Iterator object to loop through the keys and do the matching by yourself.

 data keys;
length key  $10 value $1;
input key value;
cards;
LongString A
LongOther B
;
run;

proc sort data=keys;
by key;
run;


data data;
length short_key  $10;
input short_key ;
cards;
LongStr
LongSt
LongOther
LongOth
LongOt
LongO
LongSt
LongOther
;
run;

data match;
    set data;
    length key $20 outvalue value $1;
    drop key value rc;
    if _N_ = 1 then do;
       call missing(key, value);
       declare hash h1(dataset:"work.keys", ordered: 'yes');
       declare hiter iter ('h1');
       h1.defineKey('key');
       h1.defineData('key', 'value');
       h1.defineDone();
    end;
    rc = iter.first();/* reset to beginning */
    do while (rc = 0);/* loop through the long keys and find a match */
        if index(key, trim(short_key)) > 0 then do;
            outvalue = value;
            iter.last(); /* leave after match */
        end;
        rc = iter.next(); 
    end; 
run;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top