Is it possible to do a case-insensitive DISTINCT with SAS (PROC SQL)?

https://stackoverflow.com/questions/924513

06-09-2019
|

Question

Is there a way to get the case-insensitive distinct rows from this SAS SQL query? ...

SELECT DISTINCT country FROM companies;

The ideal solution would consist of a single query.

Results now look like:

Australia
australia
AUSTRALIA
Hong Kong
HONG KONG

... where any of the 2 distinct rows is really required

One could upper-case the data, but this unnecessarily changes values in a manner that doesn't suit the purpose of this query.

Solution

If you have some primary int key (let's call it ID), you could use:

SELECT country FROM companies
WHERE id =
(
    SELECT Min(id) FROM companies
    GROUP BY Upper(country)
)

OTHER TIPS

Normalizing case does seem advisable -- if 'Australia', 'australia' and 'AUSTRALIA' all occur, which one of the three would you want as the "case-sensitively unique" answer to your query, after all? If you're keen on some specific heuristics (e.g. count how many times they occur and pick the most popular), this can surely be done but might be a huge amount of extra work -- so, how much is such persnicketiness worth to you?

A non-SQL method (really only a single step as the data step just creates a view) would be:


data companies_v /view=companies_v;
  set companies (keep=country);
  _upcase_country = upcase(country);
run;

proc sort data=companies_v out=companies_distinct_countries (drop=_upcase_country) nodupkey noequals;
  by _upcase_country;
run;

Maybe I'm missing something, but why not just:

data testZ;
    input Name $;
    cards4;
Bob
Zach
Tim
Eric
Frank
ZacH
BoB
eric
;;;;
run;

proc sql;
    create view distinctNames as
    select distinct Upper(Name) from testz;
quit;

This creates a view with only distinct names as row values.

I was thinking along the same lines as Zach, but thought I would look at the problem with a more elaborate example,

proc sql;
    CREATE TABLE contacts (
        line1 CHAR(30), line2 CHAR(30), pcode CHAR(4)
    );
    * Different versions of the same address - L23 Bass Plaza 2199;
    INSERT INTO contacts values('LEVEL 23 bass', 'plaza'  '2199');
    INSERT INTO contacts values('level 23 bass ', ' PLAZA'  '2199');

    INSERT INTO contacts values('Level 23', 'bass plaza'  '2199');
    INSERT INTO contacts values('level 23', 'BASS plaza'  '2199');

    *full address in line 1;
    INSERT INTO contacts values('Level 23 bass plaza', ''  '2199');
    INSERT INTO contacts values(' Level 23 BASS plaza  ', ''  '2199');

;quit;

Now we can output
i. One from each category? Ie three addresses ?
OR
ii. Or just one address ? if so which version should we prefer ?

Implementing case 1 can be as simple as :

proc sql;
    SELECT DISTINCT UPCASE(trim(line1)), UPCASE(trim(line2)), pcode 
    FROM contacts 
;quit;

Implementing case 2 can be as simple as:

proc sql;
    SELECT DISTINCT UPCASE( trim(line1) || ' ' || trim(line2) ) , pcode 
    FROM contacts 
;quit;

From SAS 9:

proc sort data=input_ds sortseq=linguistic(strengh=primary);

  by sort_vars;

run;

I think Regular expressions can help you out with the pattern you want to have in your search string.

For the regular expression you can define a UDF which can be prepared seeing the tutorial. www.sqlteam.com/article/regular-expressions-in-t-sql

Thanks.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow