Question

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.

  • Char[0] = A..Z, a..z
  • Char[1] = 0..9
  • Char[2] = 0..9
  • Char[3] = 0..9
  • Char[4] = 0..9

{No blanks allowed}

Basically, a clean record will look like this:

  • T1234, U2468, K123, P50054 (4 record examples)

Junk data looks like this:

  • T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)

Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?

A function would be great though!

Was it helpful?

Solution

The following should work in a few different systems:

SELECT * 
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'

This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.

Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):

SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5

To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:

CREATE FUNCTION ufn_IsProperFormat(@data VARCHAR(50))
RETURNS BIT
AS
BEGIN
    RETURN 
     CASE 
      WHEN SUBSTRING(@Data, 1, 1) LIKE '[A-Za-z]'
        AND SUBSTRING(@Data, 2, LEN(@Data) - 1) NOT LIKE '%[^0-9]%'
        AND LEN(@Data) >= 5 THEN 1 
       ELSE 0 
      END
END

...and call into it like so:

SELECT * 
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1

...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:

SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')

This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.

EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.

The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.

Added a custom SQL Server 2008 approach which implements these techniques.

OTHER TIPS

Depends on your database. Many have regex functions (note examples not tested so check)

e.g. Oracle

SELECT x
 FROM table
 WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')

Sybase uses LIKE

Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:

SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
    -- everything from 2nd character onwards is a number
    AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
    -- number doesn't have a decimal place
    AND Data NOT LIKE '%.%'

For more information look at the ISNUMERIC function on MSDN.

Also note that:

  • I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
  • Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others

A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).

You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...

WHERE
    (
    Data LIKE '[A-Za-z][0-9][0-9][0-9]'
    OR
    Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
    OR
    Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
    )

David's answer allows "D1234junk" through

You also only need "[A-Z]" if you don't have case sensitivity

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top