Question

I am trying to separate a city/state/zip field into the city, state, and zip. Normally I would do this with charindex of ',' to get the city and state, and isnumeric and right() for the zip.

This will work fine for the zip, but most of the rows in the data I am working with now have no commas City ST Zip. Is there a way to identify the index of two upper case characters?

If not, does anybody have a better idea than just a case statement checking for each state individually?

EDIT: I found the PATINDEX/COLLATE option to work fairly intermittently. See my answer below.

Was it helpful?

Solution 5

I found the PATINDEX/COLLATE option to work fairly intermittently. Here is what I ended up doing:

--get rid of the sparsely used commas
--get rid of the duplicate spaces
update MyTable set
    CityStZip= 
        replace(
            replace(
                replace(CityStZip,'   ',' '),
                '  ',' '),
            ',','')

select
    --check if state and zip are there and then grab the city
    case when isNumeric(right(CityStZip,1))=1
            then left(CityStZip,len(CityStZip)-charindex(' ',reverse(CityStZip),
                                        charindex(' ',reverse(CityStZip))+1)+1)
        --no zip. check for state
        when left(right(CityStZip,3),1) = ' '
            then left(CityStZip,len(CityStZip)-charIndex(' ',reverse(CityStZip)))
        else CityStZip
        end as City,
    --check if zip is there and then grab the city
    case when isNumeric(right(CityStZip,1))=1
            then substring(CityStZip,
                    len(CityStZip)-charindex(' ',reverse(CityStZip),
                                                charindex(' ',reverse(CityStZip))+1)+2,
                    2)
        --no zip. check if 3rd to last char is a space and grab the last two chars
        when left(right(CityStZip,3),1) = ' '
            then right(CityStZip,2)
        end as [State],
    --grab everything after the last space if the last character is numeric
    case when isNumeric(right(CityStZip,1))=1
            then substring(CityStZip,
                    len(CityStZip)-charindex(' ',reverse(CityStZip))+1,
                    charindex(' ',reverse(CityStZip)))
        end as Zip
from MyTable

OTHER TIPS

PATINDEX should work for you:

PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as)

So your full extract would be something like:

WITH CTE AS
(   SELECT  i = PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as) + 1,
            A
    FROM    (VALUES 
                ('City ST Zip'),
                ('Another City ST Zip'),
                ('City, with comma ST Zip')
            ) t (A)
)
SELECT  City = LEFT(A, i - 2),
        State = SUBSTRING(A, i, 2),
        Zip = SUBSTRING(A, i + 3, LEN(A))
FROM    CTE;

Example on SQL Fiddle

The reason why PATINDEX appears to work intermittently is that you cannot use a character range (i.e. A-Z) to accomplish a case-sensitive search, even if using a case-sensitive collation. The issue is that character ranges work like sorting, and case-sensitive sorting groups the upper-case letters with their lower-case equivalents, just like it would be ordered in a dictionary. Range sorting is really: a,A,b,B,c,C,d,D,etc. Or, depending on the collation, it might be: A,a,B,b,C,c,D,d,etc (there are 31 Collations that sort upper-case first). When doing this in a case-sensitive collation, that merely groups all A entries together, separate from the a entries, whereas in a case-insensitive sort they would be intermixed.

But if you specify each of the letters individually (hence not using a range), then it will work as expected:

PATINDEX(N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%',
     [CityStZip] COLLATE Latin1_General_100_CS_AS)

The reason that PATINDEX and LIKE (both of which allow for a single character class of [A-Z]) work this way is that the [start-end] syntax is not a Regular Expression. Many people claim that PATINDEX and LIKE support "limited" RegEx due to supporting this syntax, but that is not true. It is merely a very similar (and a confusingly similar) syntax to RegEx where [A-Z] would normally not include any lower-case matches.

Of course, if you are guaranteed to only be searching on the US-English letters of A-Z, then a binary collation (i.e. one ending in _BIN2; don't use ones ending in _BIN as they have been deprecated since SQL Server 2005 was introduced, I believe) should work.

PATINDEX(N'%[A-Z][A-Z]%', [CityStZip] COLLATE Latin1_General_100_BIN2)

For more details about case-sensitive matching, especially in regards to including Unicode / NVARCHAR data, please see my related answer on DBA.StackExchange:

How to find values with multiple consecutive upper case characters

If you have zip code and state at the end of the string, then this might work:

select right(address, 5) as zip,
       left(right(address, 8), 2) as state,
       left(address, len(address) - 9) as city

You can start by removing the commas and double spaces from the address.

If you have a table of states(which you should) with a column of the abbreviations you can do things like this:

SELECT a.* FROM Addresses a
INNER JOIN States s ON
a.CityStateZip Like '% ' + s.UpperCaseAbbreviation + ' %' --space on either side of abbreviation

You can make it work for both commas and spaces:

SELECT a.* FROM Addresses a
INNER JOIN States s ON
Replace(a.CityStateZip, ',' , ' ') Like '% ' + s.UpperCaseAbbreviation + ' %'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top