Question

Our application has a CustomerNumber field. We have hundreds of different people using the system (each has their own login and their own list of CustomerNumbers). An individual user might have at most 100,000 customers. Many have less than 100.

Some people only put actual numbers into their customer number fields, while others use a mixture of things. The system allows 20 characters which can be A-Z, 0-9 or a dash, and stores these in a VARCHAR2(20). Anything lowercase is made uppercase before being stored.

Now, let's say we have a simple report that lists all the customers for a particular user, sorted by Customer Number. e.g.

SELECT CustomerNumber,CustomerName
FROM Customer
WHERE User = ?
ORDER BY CustomerNumber;

This is a naive solution as the people that only ever use numbers do not want to see a plain alphabetic sort (where "10" comes before "9").

I do not wish to ask the user any unnecessary questions about their data.

I'm using Oracle, but I think it would be interesting to see some solutions for other databases. Please include which database your answer works on.

What do you think the best way to implement this is?

Was it helpful?

Solution

In Oracle 10g:

SELECT  cust_name
FROM    t_customer c 
ORDER BY
    REGEXP_REPLACE(cust_name, '[0-9]', ''), TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+'))

This will sort by the first occurence of number, not regarding it's position, i. e.:

  1. customer1 < customer2 < customer10
    • cust1omer ? customer1
    • cust8omer1 ? cust8omer2

, where a ? means that the order is undefined.

That suffices for most cases.

To force sort order on case 2, you may add a REGEXP_INSTR(cust_name, '[0-9]', n) to ORDER BY list n times, forcing order on the first appearance of n-th (2nd, 3rd etc.) group of digits.

To force sort order on case 3, you may add a TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+', n)) to ORDER BY list n times, forcing order of n-th. group of digits.

In practice, the query I wrote is enough.

You may create a function based index on these expressions, but you'll need to force it with a hint, and a one-pass SORT ORDER BY will be performed anyway, as the CBO doesn't trust function-base indexes enough to allow an ORDER BY on them.

OTHER TIPS

Probably your best bet is to pre-calculate a separate column and use that for ordering and use the customer number for display. This would probably involve 0-padding any internal integers to a fixed length.

The other possibility is to do your sorting post-select on the returned results.

Jeff Atwood has put together a blog posting about how some people calculate human friendly sort orders.

You could have a numeric column [CustomerNumberInt] that is only used when the CustomerNumber is purely numeric (NULL otherwise[1]), then

ORDER BY CustomerNumberInt, CustomerNumber

[1] depending on how your SQL version handles NULLs in ORDER BY you might want to default it to zero (or infinity!)

I have a similar horrible situation and have developed a suitably horrible function to deal with it (SQLServer)

In my situation I have a table of "units" (this is a work-tracking system for students, so unit in this context represents a course they're doing). Units have a code, which for the most part is purely numeric, but for various reasons it was made a varchar and they decided to prefix some by up to 5 characters. So they expect 53,123,237,356 to sort normally, but also T53, T123, T237, T356

UnitCode is a nvarchar(30)

Here's the body of the function:

declare @sortkey nvarchar(30)

select @sortkey = 
    case
        when @unitcode like '[^0-9][0-9]%' then left(@unitcode,1) + left('000000000000000000000000000000',30-(len(@unitcode))) + right(@unitcode,len(@unitcode)-1)
        when @unitcode like '[^0-9][^0-9][0-9]%' then left(@unitcode,2) + left('000000000000000000000000000000',30-(len(@unitcode))) + right(@unitcode,len(@unitcode)-2)
        when @unitcode like '[^0-9][^0-9][^0-9][0-9]%' then left(@unitcode,3) + left('000000000000000000000000000000',30-(len(@unitcode))) + right(@unitcode,len(@unitcode)-3)
        when @unitcode like '[^0-9][^0-9][^0-9][^0-9][0-9]%' then left(@unitcode,4) + left('000000000000000000000000000000',30-(len(@unitcode))) + right(@unitcode,len(@unitcode)-4)
        when @unitcode like '[^0-9][^0-9][^0-9][^0-9][^0-9][0-9]%' then left(@unitcode,5) + left('000000000000000000000000000000',30-(len(@unitcode))) + right(@unitcode,len(@unitcode)-5)
        when @unitcode like '%[^0-9]%' then @unitcode
        else left('000000000000000000000000000000',30-len(@unitcode)) + @unitcode
    end 

return @sortkey

I wanted to shoot myself in the face after writing that, however it works and seems not to kill the server when it runs.

I used this in SQL SERVER and working great: Here the solution is to pad the numeric values with a character in front so that all are of the same string length.

Here is an example using that approach:

select MyCol
from MyTable
order by 
    case IsNumeric(MyCol) 
        when 1 then Replicate('0', 100 - Len(MyCol)) + MyCol
        else MyCol
    end

The 100 should be replaced with the actual length of that column.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top