Obtaining numeric/normalized representation of strings to aid in 'natural sort ordering' of titles in DB

StackOverflow https://stackoverflow.com/questions/591256

Question

I'd like to store an additional column in a table as a 'sort value', which is a numeric representation of the title column, such that the order of such values represents the string's natural alphabetical sort order. Ie, so that I can retrieve rows ordered by the sort value, and they'll be in natural sort order - and when I insert a new row, I can generate the numeric value and know that value relative to others will represent the string's position in an alphabetic search, accurate to the first X letters or so.

A couple of reasons for this: firstly, I would like a more natural ordering than a plain ordering offered by a DB server, where things like "The" and "A" and punctuation are ignored at the start, and numbers are treated 'naturally'.

Secondly, this is for an index with a lot of permutations - it will save space, and perhaps time when traversing an index with many rows.

What I am after for is the algorithm to translate the string to that numeric value, or just, I suppose, a normalised string value.

I am using PHP and MySQL.

I'm afraid that "pull everything from the DB and sort in PHP using natcasesort()" is not a solution for this particular situation, as I'd like to retrieve rows (using order by and group by) in sorted order before they get to a join or limit clause. Thanks.

Edit:

Thanks for answers so far. It's just occurred to me that the fact my application uses UTF-8 is quite relevant. With that said, I think the practicality of representing the initial part of a string in a packed/numeric form is a stretch, maybe just some sort of normalised form (everything case-folded, numbers zero-padded, and as many characters as possible normalised to their root ie ã to a) would be appropriate.

Was it helpful?

Solution 2

Thanks for the answers so far. I just wanted to update people with the solution I'm going with. I've taken an approach that is different from that which I envisaged in my question.

To recap, I wanted to store a representations of strings such that when retrieved in binary order, whatever I stored for "8 Mile" would be sorted before whatever I stored for "101 Dalmations".

For each number in the string, which is essentially a sequence of digits, I insert a digit before them that describes how many digits the number is.

So, "8" becomes "18", and "101" becomes "3101". It adds some redundancy to the number, in that you are using more digits than you need and some values won't exist, but they now have the property that a binary sort will sort the numbers into numerical order. "101" would have sorted before "8" beforehand, which was undesired. After adding that extra digit, "18" sorts before "3101".

Note: if the number is 9 or more digits long, I add two digits to the start: the number of digits in the number minus 9, then a 9, then the number. This allows for numbers up to 18 digits: good enough for me.

I'm also normalising the string in other ways too - everything to lower case, Unicode characters will be translated into the closest ascii equivalent, and 'a', 'an', and 'the' will be stripped if they are the first word.

I gave up on making the string into one big numeric value; it is still a string, it's just that it's not designed for humans to read.

OTHER TIPS

The part "accurate to the first X letters or so" is crucial, since a completely accurate assignment of numbers is impossible. To see this, suppose for concreteness that your title column is varchar(50) and you want to use a 32-bit integer sort_order column. Then you could store (255^51 - 1) different titles, each of which would require a different sort_order value -- but there are only 2^32 different sort_order values to go around. Even if you said you would never add more than 2^32 rows, you would need to know in advance which titles they would have in order to come up with a scheme that avoided having to reassign all sort_order values every time a row was inserted.

Although a "theoretically perfect" solution is impossible, it's still possible to get a practical "approximate" system that should work with perfect accuracy for up to many millions of rows. The simplest way would be to use a floating-point type. Initially, list out the rows in sorted order and assign the first row a sort_order value of 1.0, the second a value of 2.0 and so on. Then, whenever a row is inserted, set its sort_order to the midpoint (that is, the average) of the rows on either side in sorted order. If the newly added row comes before (or after) all existing rows, just set it to 1 less than (or more than) the previous minimum (or maximum) sort_order value.

It's a good idea to reassign numbers from scratch (as in the initial build step) to "smooth out" the values periodically, or after a large number of updates. Particularly if the table starts small and then gets big, you may find some "bunching" of numbers at the ends.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top