Question

  • I'm using amazon redshift as my data warehouse
  • I have a field (field1)of type string. Some of the strings start with four numbers and others with letters:

'test alpha'
'1382 test beta'

  • I want to filter out rows where the string does not start with four numbers
  • Looking at the redshift documentation, I don't believe isnumber or isnumeric are functions. It seems that the 'like' function is the best possibility.
  • I tried

    where left(field1, 4) like '[0-9][0-9][0-9][0-9]'

this did not work and from the link below seems like redshift may not support that:

https://forums.aws.amazon.com/message.jspa?messageID=439850

is there an error in the 'where' clause? if not and that clause isn't supported in redshift, is there a way to filter? I was thinking of using cast

cast(left(field1,4) as integer) 

and then passing over the row if it generated an error, but not sure how to do this in amazon redshift. or is there some other proxy for the isnumeric filter.

thanks

Was it helpful?

Solution 4

It seems that redshift doesn't support any of the following:

where left(field1,4) like '[0-9][0-9][0-9][0-9]' 
where left(field1,4) ~ '^[0-9]{4}'
where left(field1,4) like '^[0-9]{4}'

what does seem to work is:

where left(field1,4) between 0 and 9999

this returns all rows that start with four numeric characters.

it seems that even though field1 is type string, the 'between' function interprets left(field1,4) as a single integer when the string characters are numeric (and does not give an error when they are not numeric). I'll follow up if I find a problem. For instance I don't deal with anything less than 1000, so I assume, but am not sure, that 0001 is interpreted as 1.

OTHER TIPS

Try something like:

where field1 ~ '^[0-9]{4}'

It will match any string, that starts with 4 digits.

Although long time has passed since this question was asked I have not found an adequate response. So I feel obliged to share my solution which works fine on my Redshift cluster today (March 2016).

The UDF function is:

create or replace function isnumeric (aval VARCHAR(20000))
  returns bool
IMMUTABLE 
as $$
    try:
       x = int(aval);
    except:
       return (1==2);
    else:
       return (1==1);
$$ language plpythonu;

Usage would be:

select isnumeric(mycolumn), * from mytable
    where isnumeric(mycolumn)=false

It looks like the code you are looking for the is the similar to function:

where left(field,4) similar to '[0-9]{4}'

Redshift doc

Per Amazon, the posix style ~regex style expressions are slow... https://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions.html

Using their own REGEXP_* functions seems to be faster. https://docs.aws.amazon.com/redshift/latest/dg/String_functions_header.html

For checking just a true/false for integers I've been using the following with success. REGEXP_COUNT(my_field_to_check, '^[0-9]+$') > 0

this returns 1 if only numeric, 0 if anything else

where regexp_instr(field1,'^[0-9]{4}') = 0

will remove rows starting with 4 digits (the above regexp_instr will return 1 for the rows with field1 starting with 4 digits)

We have tried the following and worked for most of our scenarios:

columnn ~ '^[-]{0,1}[0-9]{1,}[.]{0,1}[0-9]{0,}$'

This will positive, negative, integer and float numbers.

redshift should support similar to.

WHERE field1 SIMILAR TO '[0-9]{4}%'

This reads as where field1 starts with 4 characters in the range of 0-9, then anything else.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top