ORDER BY and comparison of mixed strings of letters and numbers
-
29-09-2020 - |
Question
We need to do some reporting on values that are usually mixed strings of numbers and letters that need to be sorted 'naturally'. Things like, e.g. "P7B18" or "P12B3". @The strings will mostly be sequences of letters then numbers alternating. The number of these segments and the length of each could vary, though.
We'd like the numeric portions of these to be sorted in numeric order. Obviously, if I just handle those string values directly with ORDER BY
, then "P12B3" is going to come before "P7B18", since "P1" is earlier than "P7", but I'd like the reverse, as "P7" naturally precedes "P12".
I'd also like to be able to do range comparisons, e.g. @bin < 'P13S6'
or some such. I don't have to handle floating point or negative numbers; these will strictly be non-negative integers that we're dealing with. String lengths and number of segments could potentially be arbitrary, with no fixed upper bounds.
In our case, string casing isn't important, though if there's a way to do this in a collation-aware fashion, others might find that useful. The ugliest part of all this is I'd like to be able to do both ordering, and range filtering in the WHERE
clause.
If I were doing this in C#, it would be a pretty simple task: do some parsing to separate the alpha from the numeric, implement IComparable, and you're basically done. SQL Server, of course, doesn't appear to offer any similar functionality, at least as far as I'm aware.
Anybody know any good tricks to make this work? Is there some little-publicized ability to create custom CLR types that implement IComparable and have this behave as expected? I'm also not opposed to Stupid XML Tricks (see also: list concatenation), and I've got CLR regex matching/extracting/replacement wrapper functions available on the server as well.
EDIT: As a slightly more detailed example, I'd want the data to behave something like this.
SELECT bin FROM bins ORDER BY bin
bin
--------------------
M7R16L
P8RF6JJ
P16B5
PR7S19
PR7S19L
S2F3
S12F0
i.e. break the strings into tokens of all letters or all numbers, and sort them either alphabetically or numerically respectively, with the leftmost tokens being the most significant sorting term. Like I mentioned, piece of cake in .NET if you implement IComparable, but I don't know how (or if) you can do that sort of thing in SQL Server. It's certainly not something I've ever come across in 10 or so years of working with it.
Solution
Want a sensible, efficient means of sorting numbers in strings as actual numbers? Consider voting for my Microsoft Connect suggestion: Support "natural sorting" / DIGITSASNUMBERS as a Collation option
There is no easy, built-in means of doing this, but here is a possibility:
Normalize the strings by reformatting them into fixed-length segments:
- Create a sort column of type
VARCHAR(50) COLLATE Latin1_General_100_BIN2
. The max length of 50 might need to be adjusted based on the max number of segments and their potential maximum lengths. - While the normalization could be done in the app layer more efficiently, handling this in the database using a T-SQL UDF would allow for placing the scalar UDF into an
AFTER [or FOR] INSERT, UPDATE
Trigger such that you are guaranteed of properly setting the value for all records, even those coming in via ad hoc queries, etc. Of course, that scalar UDF can also be handled via SQLCLR, but it would need to be tested to determine which one was actually more efficient. ** - The UDF (regardless of being in T-SQL or SQLCLR) should:
- Process an unknown number of segments by reading each character and stopping when the type switches from alpha to numeric or numeric to alpha.
- Per each segment it should return a fixed-length string set to the maximum possible characters/digits of any segment (or maybe max + 1 or 2 to account for future growth).
- Alpha segments should be left-justified and right-padded with spaces.
- Numeric segments should be right-justified and left-padded with zeroes.
- If alpha characters can come in as mixed-case but the ordering needs to be case-insensitive, apply the
UPPER()
function to the final result of all segments (so that it only needs to be done once and not per segment). This will allow for proper sorting given the binary collation of the sort column.
- Create an
AFTER INSERT, UPDATE
Trigger on the table that calls the UDF to set the sort column. To improve performance, use theUPDATE()
function to determine if this code column is even in theSET
clause of theUPDATE
statement (simplyRETURN
if false), and then join theINSERTED
andDELETED
pseudo-tables on the code column to only process rows that have changes in the code value. Be sure to specifyCOLLATE Latin1_General_100_BIN2
on that JOIN condition to ensure accuracy in determining if there is a change. - Create an Index on the new sort column.
Example:
P7B18 -> "P 000007B 000018"
P12B3 -> "P 000012B 000003"
P12B3C8 -> "P 000012B 000003C 000008"
In this approach, you can sort via:
ORDER BY tbl.SortColumn
And you can do range filtering via:
WHERE tbl.SortColumn BETWEEN dbo.MyUDF('P7B18') AND dbo.MyUDF('P12B3')
or:
DECLARE @RangeStart VARCHAR(50),
@RangeEnd VARCHAR(50);
SELECT @RangeStart = dbo.MyUDF('P7B18'),
@RangeEnd = dbo.MyUDF('P12B3');
WHERE tbl.SortColumn BETWEEN @RangeStart AND @RangeEnd
Both the ORDER BY
and the WHERE
filter should use the binary collation defined for SortColumn
due to Collation Precedence.
Equality comparisons would still be done on the original value column.
Other thoughts:
Use a SQLCLR UDT. This might could work, though it is unclear if it presents a net-gain as compared to the approach described above.
Yes, a SQLCLR UDT can have its comparison operators overridden with custom algorithms. This handles situations where the value is being compared to either another value that is already the same custom type, or one that needs to be implicitly converted. This should handle the range filter in a
WHERE
condition.With regards to sorting the UDT as a regular column type (not a computed column), this is only possible if the UDT is "byte ordered". Being "byte ordered" means that the binary representation of the UDT (which can be defined in the UDT) naturally sorts in the appropriate order. Assuming that the binary representation is handled similarly to the approach described above for the VARCHAR(50) column that has fixed-length segments that are padded, that would qualify. Or, if it was not easy to ensure that the binary representation would naturally be ordered in the proper way, you could expose a method or property of the UDT that outputs a value that would be properly ordered, and then create a
PERSISTED
computed column on that method or property. The method needs to be deterministic and marked asIsDeterministic = true
.Benefits of this approach are:
- No need for an "original value" field.
- No need to call a UDF to insert the data or to compare values. Assuming that the
Parse
method of the UDT takes in theP7B18
value and converts it, then you should be able to simply insert the values naturally asP7B18
. And with the implicit conversion method set in the UDT, the WHERE condition would also allow for using simply P7B18`.
Consequences of this approach are:
- Simply selecting the field will return the binary representation, if using the byte ordered UDT as the column datatype. Or if using a
PERSISTED
computed column on a property or method of the UDT, then you would get the representation returned by the property or method. If you want the originalP7B18
value, then you need to call a method or property of the UDT that is coded to return that representation. Since you have to override theToString
method anyway, that is a good candidate for providing this. It is unclear (at least to me right now since I have not tested this part) how easy/difficult it would be to make any changes to the binary representation. Changing the stored, sortable representation might require dropping and re-adding the field. Also, dropping the Assembly containing the UDT would fail if used in either manner, so you would want to make sure that there was nothing else in the Assembly besides this UDT. You can
ALTER ASSEMBLY
to replace the definition, but there are some restrictions on that.On the other hand, the
VARCHAR()
field is data that is disconnected from the algorithm so it would only require updating the column. And if there are tens of millions of rows (or more) then that can be done in a batched approach.
Implement the ICU library which actually allows for doing this alphanumeric sorting. While highly functional, the library only comes in two languages: C/C++ and Java. Which means you might need to either do some tweaks to get it to work in Visual C++, or there is the off chance that the Java code can be converted to MSIL using IKVM. There are one or two .NET side projects linked on that site that provide a COM interface that can be accessed in managed code, but I believe they have not been updated in a while and I have not tried them. The best-bet here would be to handle this in the app layer with the goal of generating sort keys. The sort keys would then be saved into a new sort column.
This might not be the most practical approach. However, it is still very cool that such an ability exists. I provided a more detailed walk-through of an example of this in the following answer:
Is there a collation to sort the following strings in the following order 1,2,3,6,10,10A,10B,11?
But the pattern being dealt with in that question is a bit simpler. For an example showing that the type of pattern being dealt with in this Question also works, please go to the following page:
Under "Settings", set the "numeric" option to "on" and all of the others should be set to "default". Next, to the right of the "sort" button, uncheck the option for "diff strengths" and check the option for "sort keys". Then replace the list of items in the "Input" text area with the following list:
P12B22 P7B18 P12B3 as456456hgjg6786867 P7Bb19 P7BA19 P7BB19 P007B18 P7Bb20 P7Bb19z23
Click the "sort" button. The "Output" text area should display the following:
as456456hgjg6786867 29 4D 0F 7A EA C8 37 35 3B 35 0F 84 17 A7 0F 93 90 , 0D , , 0D . P7B18 47 0F 09 2B 0F 14 , 08 , FD F1 , DC C5 DC 05 . P007B18 47 0F 09 2B 0F 14 , 08 , FD F1 , DC C5 DC 05 . P7BA19 47 0F 09 2B 29 0F 15 , 09 , FD FF 10 , DC C5 DC DC 05 . P7Bb19 47 0F 09 2B 2B 0F 15 , 09 , FD F2 , DC C5 DC 06 . P7BB19 47 0F 09 2B 2B 0F 15 , 09 , FD FF 10 , DC C5 DC DC 05 . P7Bb19z23 47 0F 09 2B 2B 0F 15 5B 0F 19 , 0B , FD F4 , DC C5 DC 08 . P7Bb20 47 0F 09 2B 2B 0F 16 , 09 , FD F2 , DC C5 DC 06 . P12B3 47 0F 0E 2B 0F 05 , 08 , FD F1 , DC C5 DC 05 . P12B22 47 0F 0E 2B 0F 18 , 08 , FD F1 , DC C5 DC 05 .
Please note that the sort keys are structure in multiple fields, separated by commas. Each field needs to be sorted independently, so that presents another small problem to solve if needing to implement this in SQL Server.
** If there is any concern about performance regarding the use of User-Defined Functions, please note that the proposed approaches make minimal use of them. In fact, the main reason for storing the normalized value was to avoid calling a UDF per each row of each query. In the primary approach, the UDF is used to set the value of SortColumn
, and that is only done upon INSERT
and UPDATE
via the Trigger. Selecting values is much more common than inserting and updating, and some value are never updated. Per each SELECT
query that uses the SortColumn
for a range filter in the WHERE
clause, the UDF is only needed one time per each of the range_start and range_end values to get the normalized values; the UDF is not called per-row.
With regards to the UDT, the usage is actually the same as with the scalar UDF. Meaning, inserting and updating would call the normalization method once per each row to set the value. Then, the normalization method would be called once per query per each range_start and range_value in a range filter, but not per row.
A point in favor of handling the normalization entirely in a SQLCLR UDF is that given it is not doing any data access and is deterministic, if it is marked as IsDeterministic = true
, then it can participate in parallel plans (which might help the INSERT
and UPDATE
operations) whereas a T-SQL UDF will prevent a parallel plan from being used.