Optimal ORACLE function parsing

Question

This isn't really the kind of direction you were asking for, but if possible I'd forget about the function - which will at least incur context switching, but probably quite a lot of other overhead - and try to do something in native SQL. I'd also change your look-up table as that isn't very flexible to have multiple columns. I'd have a single standard-abbreviation pair in each row:

STD_VALUE            ABBREVIATION
-------------------- ------------
Bar                  Ba           
Company              Co           
Company              Cmpy         
Foo                  Fo           
Limited              Ltd          
Limited              Lim          
Limited              LMT

This seems to work, at least for some basic test cases, but you would need to be on 11gR2, so you have both the listagg function and recursive subquery factoring (a.k.a recursive CTE or recursive with). I very much applaud the sentiment that you don't want someone to just write the solution, but I had fun figuring it out and it seems a shame to waste it. I've tried to explain what's going on below; hopefully you can learn from that still. And it will probably need some modification so you'll need to understand it...

with p as (
  select ' $'|| upper( std_value) || '$ ' as replace_string,
    listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc) as pattern,
    rank() over (order by std_value) as rn
  from std_gbr_co_suffix
  group by std_value
),
r (company_name, replacements, rn, std_company_name) as (
  select company_name, 0, 0, company_name from very_large_table
  union all
  select r.company_name, r.replacements + 1, p.rn, 
    regexp_replace(r.std_company_name, p.pattern, p.replace_string)
  from r
  join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
),
t as (
  select company_name, std_company_name,
    rank() over (partition by company_name order by replacements desc) as rn
  from r
)
select company_name, std_company_name
from t
where rn = 1;

SQL Fiddle demo with a few test cases; be interested to know how it fairs against your 1000-odd test items, both in terms of correctness and performance.

Breaking it down a bit, the first CTE p:

p as (
  select ' $'|| upper( std_value) || '$ ' as replace_string,
    listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc) as pattern,
    rank() over (order by std_value) as rn
  from std_gbr_co_suffix
  group by std_value
)

... generates the pattern and replacement strings for each distinct std_value (which is column1 in your original table):

REPLACE_STRING                   RN PATTERN                                          
------------------------ ---------- --------------------------------------------------
 $BAR$                            1 (^|\s)Ba(\s|$)                                     
 $COMPANY$                        2 (^|\s)Cmpy(\s|$)|(^|\s)Co(\s|$)                    
 $FOO$                            3 (^|\s)Fo(\s|$)                                     
 $LIMITED$                        4 (^|\s)LMT(\s|$)|(^|\s)Lim(\s|$)|(^|\s)Ltd(\s|$)

The second CTE r is the recursive one:

r (company_name, replacements, rn, std_company_name) as (
  select company_name, 0, 0, company_name from very_large_table
  union all
  select r.company_name, r.replacements + 1, p.rn, 
    regexp_replace(r.std_company_name, p.pattern, p.replace_string)
  from r
  join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
)

It starts with the original values from your table as the anchor member, and then recursively applies each pattern from p. (The rn in the join is to stop it applying patterns both ways, in case an original name matches more than one; you'd do more work than necessary and end up with duplicate results without that). For my dummy data that gives:

COMPANY_NAME                   REPLACEMENTS         RN STD_COMPANY_NAME                                 
------------------------------ ------------ ---------- --------------------------------------------------
Oracle Co                                 0          0 Oracle Co                                          
Oracle Ltd                                0          0 Oracle Ltd                                         
Oracle Ltd.                               0          0 Oracle Ltd.                                        
Oracle Co Ltd.                            0          0 Oracle Co Ltd.                                     
Oracle Co Ltd Cmpy LMT                    0          0 Oracle Co Ltd Cmpy LMT                             
Oracle Co                                 1          2 Oracle $COMPANY$                                   
Oracle Ltd                                1          4 Oracle $LIMITED$                                   
Oracle Co Ltd Cmpy LMT                    1          2 Oracle $COMPANY$ Ltd $COMPANY$ LMT                 
Oracle Co Ltd Cmpy LMT                    1          4 Oracle Co $LIMITED$ Cmpy $LIMITED$                 
Oracle Co Ltd.                            1          2 Oracle $COMPANY$ Ltd.                              
Oracle Co Ltd Cmpy LMT                    2          4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

You can see that an original name can have several substitutions, and the original value itself; specifically here:

Oracle Co Ltd Cmpy LMT                    0          0 Oracle Co Ltd Cmpy LMT                             
Oracle Co Ltd Cmpy LMT                    1          2 Oracle $COMPANY$ Ltd $COMPANY$ LMT                 
Oracle Co Ltd Cmpy LMT                    1          4 Oracle Co $LIMITED$ Cmpy $LIMITED$                 
Oracle Co Ltd Cmpy LMT                    2          4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

The replacements value tracks how many regex's matched, so if multiple ones did then we can pick out which has the most - for an original value that has both Company and Limited patterns, there's one row for each of the individual replacements, and one that has both applied (again, only one, because of the rn check).

The final CTE t (yes, I struggled to give these meaningful names) just determines which row has the most replacements for each; each original gets one row ranked as 1:

t as (
  select company_name, std_company_name,
    rank() over (partition by company_name order by replacements desc) as rn
  from r
)

... and finally we exclude everything that wasn't ranked first, which leaves:

COMPANY_NAME                   STD_COMPANY_NAME                                 
------------------------------ --------------------------------------------------
Oracle Co                      Oracle $COMPANY$                                   
Oracle Co Ltd Cmpy LMT         Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$     
Oracle Co Ltd.                 Oracle $COMPANY$ Ltd.                              
Oracle Ltd                     Oracle $LIMITED$                                   
Oracle Ltd.                    Oracle Ltd.

You might be able to do something with the model clause instead, but I'm not very familiar with that...

If you do want to stick with a function, if only to learn how this could be done, then I don't think you want a ref cursor, but a bulk collect wouldn't hurt. You can declare a table type within your package that you populate once (per session), and then refer to that when you apply the regexp_replace calls in your function.

Again probably more complete than you really wanted, but there isn't that much to strip out... unless you stop now and just go and read up on PL/SQL collections and what makes a package stageful, in the PL/SQL Language Reference.

Your package specification can declare a record type to hold a regex pattern and replacement string, and a table of those records. And, crucially, a package-level variable of that table type. That will make the package stateful, and the state will be associated with your session - different people running this would have their own copy of that table variable.

create or replace package p42 as
  type regex_rec_type is record(pattern varchar2(4000),
    replace_string varchar2(50));
  type regex_tab_type is table of regex_rec_type;

  regexes regex_tab_type;

  function standardise_company_suffix(company_name in varchar2)
  return varchar2;
end p42;
/

The package body is in two parts; your function and an initialisation block:

create or replace package body p42 as
  function standardise_company_suffix(company_name in varchar2)
  return varchar2 is
    std_company_name varchar2(4000);
  begin
    std_company_name := company_name;
    for i in regexes.first..regexes.last loop
      std_company_name := regexp_replace(std_company_name,
        regexes(i).pattern, regexes(i).replace_string);
    end loop;
    return std_company_name;
  end standardise_company_suffix;

begin
  select listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc),
    ' $'|| upper( std_value) || '$ '
  bulk collect into regexes
  from std_gbr_co_suffix
  group by std_value;
end p42;
/

The block first; that runs the same listagg query I used before, on the same value-pair table I had to replace yours, so it's getting the same patterns and replacement strings. This are put into the regexes variable declared in the package specification. That happens once per session, when the package is first referenced.

The function starts with the original company name and loops over the collection of regex records, applying each one in turn. And then it returns the final result. Pretty straightforward, but I'll leave you to look in the PL/SQL reference to see exactly what it's doing. I haven't bothered with the regexp_instr as I think it will be more expensive that just applying the replacement, but it's probably worth experimenting with that as I'm not sure.

With the same dummy data:

select company_name,
  p42.standardise_company_suffix(company_name) as std_company_name
from very_large_table;

COMPANY_NAME                   STD_COMPANY_NAME                                 
------------------------------ --------------------------------------------------
Oracle Co                      Oracle $COMPANY$                                   
Oracle Ltd                     Oracle $LIMITED$                                   
Oracle Ltd.                    Oracle Ltd.                                        
Oracle Co Ltd.                 Oracle $COMPANY$ Ltd.                              
Oracle Co Ltd Cmpy LMT         Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

Although the code for this is arguably much simpler, I would expect this to be considerably slower over a large data set, but at least this should minimise the overhead as you were aiming to do. You'll still have context switches but you're not switching back to SQL within the PL/SQL now, which will help. Again I'd be interested in a performance comparison against what you have now, and the recursive CTE version.