CASE with regexp: “set-valued function called in context that cannot accept a set”

https://stackoverflow.com/questions/12795306

06-07-2021
|

Question

I'm trying to do a slightly complicated string conversion in PostgreSQL 9.0.1. The values in my_col are long strings in the format of:

'12345_sometext_X12B_1'
'12345_sometext_optionaltext_Y09B_1'
'12345_sometext_optionaltext_X12A_1'

I need to translate the 'X12' portion to a known numeric value, there are a few different known values (up to 5).

I expect to be able to determine this within one query without needing a subquery. However, the following isn't working for me. The last column is the one throwing the exception. It seems I cannot do the CASE statement using the output of these functions combined for some reason. I've included the proceeding columns for demonstrational purposes only.

select
          regexp_matches(my_col, E'^.*_([^_]*)[A-Z]{1}_\\d*$'), -- returns {'X12'}
         (regexp_matches(my_col, E'^.*_([^_]*)[A-Z]{1}_\\d*$'))[1], -- returns 'X12'
    case (regexp_matches(my_col, E'^.*_([^_]*)[A-Z]{1}_\\d*$'))[1]
        when 'X12' then '1200'
        when 'Y09' then '950'
        else '?' end -- should return '1200' but throws error
from my_table;

Instead I get the error:

ERROR: set-valued function called in context that cannot accept a set
SQL state: 0A000

Can someone advise me?

Solution

Given data:

create table my_table(my_col text);
insert into my_table(my_col) values
('12345_sometext_X12B_1'),
('12345_sometext_optionaltext_Y09B_1'),
('12345_sometext_optionaltext_X12A_1'),
('nomatch');

the above query does produce the error you report. Very odd, since:

SELECT pg_typeof((regexp_matches(my_col, E'^.*_([^_]*)[A-Z]{1}_\\d*$'))[1]);

returns 'text'. It should really say setof text though, and that's the trap: regex_matches is a set returning function. Those have ... interesting ... behaviour when called outside the FROM clause in PostgreSQL.

From pattern matching:

The regexp_matches function returns a text array of all of the captured substrings resulting from matching a POSIX regular expression pattern. It has the syntax regexp_matches(string, pattern [, flags ]). The function can return no rows, one row, or multiple rows

Try reformulating your query to use a subquery to call the SRF. This'll fail if more than one row is returned by the matcher, though:

SELECT 
  CASE (SELECT x[1] FROM regexp_matches(my_col, E'^.*_([^_]*)[A-Z]{1}_\\d*$') x)
    WHEN 'X12' THEN '1200'
    WHEN 'Y09' THEN '950'
    ELSE '?'
  END
FROM my_table;

Want to see how weird SRFs in SELECT are in Pg? Compare the results of these queries:

SELECT generate_series(1,10), generate_series(1,15);

and:

SELECT generate_series(1,10), generate_series(1,20);

The 1st produces 30 rows. The 2nd produces 20. Have fun explaining why. More than one SRF in a SELECT list in Pg produces crazy if occasionally useful results.

PostgreSQL 9.3 supports the SQL-standard LATERAL clause thanks to Tom Lane, which provides a sane and well defined alternative to the current behaviour.

OTHER TIPS

regexp_matches() returns SETOF text[] (a set of text arrays), which is useful for multiple matches for one pattern in the same string. But it is simply the wrong tool for this task.

Use substring() with a regular expression instead. It returns text. Reusing the demo table in @Craig's answer:

SELECT CASE substring(my_col, '^.*_([^_]*)[A-Z]_\d*$')
         WHEN 'X12' THEN '1200'
         WHEN 'Y09' THEN '950'
         ELSE            '?'
       END As result
FROM   my_table;

Returns:

 result
--------
 1200
 950
 1200
 ?

Also slightly simplified the regular expression. {1} was just noise.

If you need to optimize performance try to make do without regular expressions, which are powerful but comparatively expensive. Something like:

reverse(right(split_part(reverse(my_col), '_', 2), -1))

Looks more complex, but is still faster in my tests.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow