Natural Sort in MySQL
-
03-07-2019 - |
Question
Is there an elegant way to have performant, natural sorting in a MySQL database?
For example if I have this data set:
- Final Fantasy
- Final Fantasy 4
- Final Fantasy 10
- Final Fantasy 12
- Final Fantasy 12: Chains of Promathia
- Final Fantasy Adventure
- Final Fantasy Origins
- Final Fantasy Tactics
Any other elegant solution than to split up the games' names into their components
- Title: "Final Fantasy"
- Number: "12"
- Subtitle: "Chains of Promathia"
to make sure that they come out in the right order? (10 after 4, not before 2).
Doing so is a pain in the a** because every now and then there's another game that breaks that mechanism of parsing the game title (e.g. "Warhammer 40,000", "James Bond 007")
Solution
I think this is why a lot of things are sorted by release date.
A solution could be to create another column in your table for the "SortKey". This could be a sanitized version of the title which conforms to a pattern you create for easy sorting or a counter.
OTHER TIPS
Here is a quick solution:
SELECT alphanumeric,
integer
FROM sorting_test
ORDER BY LENGTH(alphanumeric), alphanumeric
Just found this:
SELECT names FROM your_table ORDER BY games + 0 ASC
Does a natural sort when the numbers are at the front, might work for middle as well.
Same function as posted by @plalx, but rewritten to MySQL:
DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
DELIMITER ;;
CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
RETURNS int
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE position int;
DECLARE tmp_position int;
SET position = 5000;
SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
IF (position = 5000) THEN RETURN 0; END IF;
RETURN position;
END
;;
DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE sortString varchar(4000);
DECLARE numStartIndex int;
DECLARE numEndIndex int;
DECLARE padLength int;
DECLARE totalPadLength int;
DECLARE i int;
DECLARE sameOrderCharsLen int;
SET totalPadLength = 0;
SET instring = TRIM(instring);
SET sortString = instring;
SET numStartIndex = udf_FirstNumberPos(instring);
SET numEndIndex = 0;
SET i = 1;
SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);
WHILE (i <= sameOrderCharsLen) DO
SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
SET i = i + 1;
END WHILE;
WHILE (numStartIndex <> 0) DO
SET numStartIndex = numStartIndex + numEndIndex;
SET numEndIndex = numStartIndex;
WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
SET numEndIndex = numEndIndex + 1;
END WHILE;
SET numEndIndex = numEndIndex - 1;
SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
IF padLength < 0 THEN
SET padLength = 0;
END IF;
SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
SET totalPadLength = totalPadLength + padLength;
SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
END WHILE;
RETURN sortString;
END
;;
Usage:
SELECT name FROM products ORDER BY udf_NaturalSortFormat(name, 10, ".")
MySQL doesn't allow this sort of "natural sorting", so it looks like the best way to get what you're after is to split your data set up as you've described above (separate id field, etc), or failing that, perform a sort based on a non-title element, indexed element in your db (date, inserted id in the db, etc).
Having the db do the sorting for you is almost always going to be quicker than reading large data sets into your programming language of choice and sorting it there, so if you've any control at all over the db schema here, then look at adding easily-sorted fields as described above, it'll save you a lot of hassle and maintenance in the long run.
Requests to add a "natural sort" come up from time to time on the MySQL bugs and discussion forums, and many solutions revolve around stripping out specific parts of your data and casting them for the ORDER BY
part of the query, e.g.
SELECT * FROM table ORDER BY CAST(mid(name, 6, LENGTH(c) -5) AS unsigned)
This sort of solution could just about be made to work on your Final Fantasy example above, but isn't particularly flexible and unlikely to extend cleanly to a dataset including, say, "Warhammer 40,000" and "James Bond 007" I'm afraid.
I've written this function for MSSQL 2000 a while ago:
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* @author Alexandre Potvin Latreille (plalx)
* @param {nvarchar(4000)} string The formatted string.
* @param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* @param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* @return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
CREATE FUNCTION dbo.udf_NaturalSortFormat(
@string nvarchar(4000),
@numberLength int = 10,
@sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE @sortString varchar(4000),
@numStartIndex int,
@numEndIndex int,
@padLength int,
@totalPadLength int,
@i int,
@sameOrderCharsLen int;
SELECT
@totalPadLength = 0,
@string = RTRIM(LTRIM(@string)),
@sortString = @string,
@numStartIndex = PATINDEX('%[0-9]%', @string),
@numEndIndex = 0,
@i = 1,
@sameOrderCharsLen = LEN(@sameOrderChars);
-- Replace all char that has to have the same order by a space.
WHILE (@i <= @sameOrderCharsLen)
BEGIN
SET @sortString = REPLACE(@sortString, SUBSTRING(@sameOrderChars, @i, 1), ' ');
SET @i = @i + 1;
END
-- Pad numbers with zeros.
WHILE (@numStartIndex <> 0)
BEGIN
SET @numStartIndex = @numStartIndex + @numEndIndex;
SET @numEndIndex = @numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(@string, @numEndIndex, 1)) = 1)
BEGIN
SET @numEndIndex = @numEndIndex + 1;
END
SET @numEndIndex = @numEndIndex - 1;
SET @padLength = @numberLength - (@numEndIndex + 1 - @numStartIndex);
IF @padLength < 0
BEGIN
SET @padLength = 0;
END
SET @sortString = STUFF(
@sortString,
@numStartIndex + @totalPadLength,
0,
REPLICATE('0', @padLength)
);
SET @totalPadLength = @totalPadLength + @padLength;
SET @numStartIndex = PATINDEX('%[0-9]%', RIGHT(@string, LEN(@string) - @numEndIndex));
END
RETURN @sortString;
END
GO
So, while I know that you have found a satisfactory answer, I was struggling with this problem for awhile, and we'd previously determined that it could not be done reasonably well in SQL and we were going to have to use javascript on a JSON array.
Here's how I solved it just using SQL. Hopefully this is helpful for others:
I had data such as:
Scene 1 Scene 1A Scene 1B Scene 2A Scene 3 ... Scene 101 Scene XXA1 Scene XXA2
I actually didn't "cast" things though I suppose that may also have worked.
I first replaced the parts that were unchanging in the data, in this case "Scene ", and then did a LPAD to line things up. This seems to allow pretty well for the alpha strings to sort properly as well as the numbered ones.
My ORDER BY
clause looks like:
ORDER BY LPAD(REPLACE(`table`.`column`,'Scene ',''),10,'0')
Obviously this doesn't help with the original problem which was not so uniform - but I imagine this would probably work for many other related problems, so putting it out there.
Add a Sort Key (Rank) in your table.
ORDER BY rank
Utilise the "Release Date" column.
ORDER BY release_date
When extracting the data from SQL, make your object do the sorting, e.g., if extracting into a Set, make it a TreeSet, and make your data model implement Comparable and enact the natural sort algorithm here (insertion sort will suffice if you are using a language without collections) as you'll be reading the rows from SQL one by one as you create your model and insert it into the collection)
Regarding the best response from Richard Toth https://stackoverflow.com/a/12257917/4052357
Watch out for UTF8 encoded strings that contain 2byte (or more) characters and numbers e.g.
12 南新宿
Using MySQL's LENGTH()
in udf_NaturalSortFormat
function will return the byte length of the string and be incorrect, instead use CHAR_LENGTH()
which will return the correct character length.
In my case using LENGTH()
caused queries to never complete and result in 100% CPU usage for MySQL
DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE sortString varchar(4000);
DECLARE numStartIndex int;
DECLARE numEndIndex int;
DECLARE padLength int;
DECLARE totalPadLength int;
DECLARE i int;
DECLARE sameOrderCharsLen int;
SET totalPadLength = 0;
SET instring = TRIM(instring);
SET sortString = instring;
SET numStartIndex = udf_FirstNumberPos(instring);
SET numEndIndex = 0;
SET i = 1;
SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);
WHILE (i <= sameOrderCharsLen) DO
SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
SET i = i + 1;
END WHILE;
WHILE (numStartIndex <> 0) DO
SET numStartIndex = numStartIndex + numEndIndex;
SET numEndIndex = numStartIndex;
WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
SET numEndIndex = numEndIndex + 1;
END WHILE;
SET numEndIndex = numEndIndex - 1;
SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
IF padLength < 0 THEN
SET padLength = 0;
END IF;
SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
SET totalPadLength = totalPadLength + padLength;
SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
END WHILE;
RETURN sortString;
END
;;
p.s. I would have added this as a comment to the original but I don't have enough reputation (yet)
To order:
0
1
2
10
23
101
205
1000
a
aac
b
casdsadsa
css
Use this query:
SELECT column_name FROM table_name ORDER BY column_name REGEXP '^\d*[^\da-z&\.\' \-\"\!\@\#\$\%\^\*\(\)\;\:\\,\?\/\~\`\|\_\-]' DESC, column_name + 0, column_name;
If you do not want to reinvent the wheel or have a headache with lot of code that does not work, just use Drupal Natural Sort ... Just run the SQL that comes zipped (MySQL or Postgre), and that's it. When making a query, simply order using:
... ORDER BY natsort_canon(column_name, 'natural')
Another option is to do the sorting in memory after pulling the data from mysql. While it won't be the best option from a performance standpoint, if you are not sorting huge lists you should be fine.
If you take a look at Jeff's post, you can find plenty of algorithms for what ever language you might be working with. Sorting for Humans : Natural Sort Order
Add a field for "sort key" that has all strings of digits zero-padded to a fixed length and then sort on that field instead.
If you might have long strings of digits, another method is to prepend the number of digits (fixed-width, zero-padded) to each string of digits. For example, if you won't have more than 99 digits in a row, then for "Super Blast 10 Ultra" the sort key would be "Super Blast 0210 Ultra".
You can also create in a dynamic way the "sort column" :
SELECT name, (name = '-') boolDash, (name = '0') boolZero, (name+0 > 0) boolNum
FROM table
ORDER BY boolDash DESC, boolZero DESC, boolNum DESC, (name+0), name
That way, you can create groups to sort.
In my query, I wanted the '-' in front of everything, then the numbers, then the text. Which could result in something like :
-
0
1
2
3
4
5
10
13
19
99
102
Chair
Dog
Table
Windows
That way you don't have to maintain the sort column in the correct order as you add data. You can also change your sort order depending on what you need.
I have tried several solutions but the actually it is very simple:
SELECT test_column FROM test_table ORDER BY LENGTH(test_column) DESC, test_column DESC
/*
Result
--------
value_1
value_2
value_3
value_4
value_5
value_6
value_7
value_8
value_9
value_10
value_11
value_12
value_13
value_14
value_15
...
*/
If you're using PHP you can do the the natural sort in php.
$keys = array();
$values = array();
foreach ($results as $index => $row) {
$key = $row['name'].'__'.$index; // Add the index to create an unique key.
$keys[] = $key;
$values[$key] = $row;
}
natsort($keys);
$sortedValues = array();
foreach($keys as $index) {
$sortedValues[] = $values[$index];
}
I hope MySQL will implement natural sorting in a future version, but the feature request (#1588) is open since 2003, So I wouldn't hold my breath.
A simplified non-udf version of the best response of @plaix/Richard Toth/Luke Hoggett, which works only for the first integer in the field, is
SELECT name,
LEAST(
IFNULL(NULLIF(LOCATE('0', name), 0), ~0),
IFNULL(NULLIF(LOCATE('1', name), 0), ~0),
IFNULL(NULLIF(LOCATE('2', name), 0), ~0),
IFNULL(NULLIF(LOCATE('3', name), 0), ~0),
IFNULL(NULLIF(LOCATE('4', name), 0), ~0),
IFNULL(NULLIF(LOCATE('5', name), 0), ~0),
IFNULL(NULLIF(LOCATE('6', name), 0), ~0),
IFNULL(NULLIF(LOCATE('7', name), 0), ~0),
IFNULL(NULLIF(LOCATE('8', name), 0), ~0),
IFNULL(NULLIF(LOCATE('9', name), 0), ~0)
) AS first_int
FROM table
ORDER BY IF(first_int = ~0, name, CONCAT(
SUBSTR(name, 1, first_int - 1),
LPAD(CAST(SUBSTR(name, first_int) AS UNSIGNED), LENGTH(~0), '0'),
SUBSTR(name, first_int + LENGTH(CAST(SUBSTR(name, first_int) AS UNSIGNED)))
)) ASC
Also there is natsort. It is intended to be a part of a drupal plugin, but it works fine stand-alone.
I know this topic is ancient but I think I've found a way to do this:
SELECT * FROM `table` ORDER BY
CONCAT(
GREATEST(
LOCATE('1', name),
LOCATE('2', name),
LOCATE('3', name),
LOCATE('4', name),
LOCATE('5', name),
LOCATE('6', name),
LOCATE('7', name),
LOCATE('8', name),
LOCATE('9', name)
),
name
) ASC
Scrap that, it sorted the following set incorrectly (It's useless lol):
Final Fantasy 1 Final Fantasy 2 Final Fantasy 5 Final Fantasy 7 Final Fantasy 7: Advent Children Final Fantasy 12 Final Fantasy 112 FF1 FF2