Humanized or natural number sorting of mixed word-and-number strings

https://stackoverflow.com/questions/12965463

09-07-2021
|

Question

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.

The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.

The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:

'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

as compared to the usual string collation ordering:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.

Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

but I want:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.

PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.

Update: Blog post on the topic.

Solution

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
        , data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

~~If special characters like {}()"', can occur, you'd have to escape those accordingly.~~
@Craig's suggestion to use a ROW expression takes care of that.

~~BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:~~

~~Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is not yet implemented.~~

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

OTHER TIPS

I faced this same problem, and I wanted to wrap the solution in a function so I could re-use it easily. I created the following function to achieve a 'human style' sort order in Postgres.

CREATE OR REPLACE FUNCTION human_sort(text)
  RETURNS text[] AS
$BODY$   
  /* Split the input text into contiguous chunks where no numbers appear,
     and contiguous chunks of only numbers. For the numbers, add leading 
     zeros to 20 digits, so we can use one text array, but sort the 
     numbers as if they were big integers.

       For example, human_sort('Run 12 Miles') gives
            {'Run ', '00000000000000000012', ' Miles'}
  */
  select array_agg(
    case
      when a.match_array[1]::text is not null 
        then a.match_array[1]::text         
      else lpad(a.match_array[2]::text, 20::int, '0'::text)::text                                      
    end::text)
    from (
      select regexp_matches(
        case when $1 = '' then null else $1 end, E'(\\D+)|(\\d+)', 'g'
      ) AS match_array      
    ) AS a  
$BODY$
  LANGUAGE sql IMMUTABLE;

tested to work on Postgres 8.3.18 and 9.3.5

No recursion, should be faster than recursive solutions
Can be used in just the order by clause, don't have to deal with primary key or ctid
Works for any select (don't even need a PK or ctid)
Simpler than some other solutions, should be easier to extend and maintain
Suitable for use in a functional index to improve performance
Works on Postgres v8.3 or higher
Allows an unlimited number of text/number alternations in the input
Uses just one regex, should be faster than versions with multiple regexes
Numbers longer than 20 digits are ordered by their first 20 digits

Here's an example usage:

select * from (values 
  ('Books 1', 9),
  ('Book 20 Chapter 1', 8),
  ('Book 3 Suffix 1', 7),
  ('Book 3 Chapter 20', 6),
  ('Book 3 Chapter 2', 5),
  ('Book 3 Chapter 1', 4),
  ('Book 1 Chapter 20', 3),
  ('Book 1 Chapter 3', 2),
  ('Book 1 Chapter 1', 1),
  ('', 0),
  (null::text, 0)
) as a(name, sort)
order by human_sort(a.name)
-----------------------------
|name               |  sort |
-----------------------------
|                   |   0   |
|                   |   0   |
|Book 1 Chapter 1   |   1   |
|Book 1 Chapter 3   |   2   |
|Book 1 Chapter 20  |   3   |
|Book 3 Chapter 1   |   4   |
|Book 3 Chapter 2   |   5   |
|Book 3 Chapter 20  |   6   |
|Book 3 Suffix 1    |   7   |
|Book 20 Chapter 1  |   8   |
|Books 1            |   9   |
-----------------------------

Adding this answer late because it looked like everyone else was unwrapping into arrays or some such. Seemed excessive.

CREATE FUNCTION rr(text,int) RETURNS text AS $$
SELECT regexp_replace(
    regexp_replace($1, '[0-9]+', repeat('0',$2) || '\&', 'g'), 
    '[0-9]*([0-9]{' || $2 || '})', 
    '\1', 
    'g'
)
$$ LANGUAGE sql;

SELECT t,rr(t,9) FROM mixed ORDER BY t;
      t       |             rr              
--------------+-----------------------------
 AAA02free    | AAA000000002free
 AAA10bob     | AAA000000010bob
 AAA2bbb03boo | AAA000000002bbb000000003boo
 AAA2bbb3baa  | AAA000000002bbb000000003baa
 AAA2fred     | AAA000000002fred
(5 rows)

(reverse-i-search)`OD': SELECT crypt('richpass','$2$08$aJ9ko0uKa^C1krIbdValZ.dUH8D0R0dj8mqte0Xw2FjImP5B86ugC');
richardh=> 
richardh=> SELECT t,rr(t,9) FROM mixed ORDER BY rr(t,9);
      t       |             rr              
--------------+-----------------------------
 AAA2bbb3baa  | AAA000000002bbb000000003baa
 AAA2bbb03boo | AAA000000002bbb000000003boo
 AAA2fred     | AAA000000002fred
 AAA02free    | AAA000000002free
 AAA10bob     | AAA000000010bob
(5 rows)

I'm not claiming two regexps are the most efficient way to do this, but rr() is immutable (for fixed length) so you can index it. Oh - this is 9.1

Of course, with plperl you could just evaluate the replacement to pad/trim it in one go. But then with perl you've always got just-one-more-option (TM) than any other approach :-)

The following function splits a string into an array of (word,number) pairs of arbitrary length. If the string begins with a number then the first entry will have a NULL word.

CREATE TYPE alnumpair AS (wordpart text,numpart integer);

CREATE OR REPLACE FUNCTION regexp_split_numstring_depth_pairs(instr text)
RETURNS alnumpair[] AS $$
WITH x(match) AS (SELECT regexp_matches($1, '(\D*)(\d+)(.*)'))
SELECT
  ARRAY[(CASE WHEN match[1] = '' THEN '0' ELSE match[1] END, match[2])::alnumpair] || (CASE 
  WHEN match[3] = '' THEN
    ARRAY[]::alnumpair[]
  ELSE 
    regexp_split_numstring_depth_pairs(match[3]) 
  END)
FROM x;$$ LANGUAGE 'sql' IMMUTABLE;

allowing PostgreSQL's composite type sorting to come into play:

SELECT data FROM alnum ORDER BY regexp_split_numstring_depth_pairs(data);

and producing the expected result, as per this SQLFiddle. I've adopted Erwin's substitution of 0 for the empty string in all strings beginning with a number so that numbers sort first; it's cleaner than using ORDER BY left(data,1), regexp_split_numstring_depth_pairs(data).

While the function is probably horrifically slow it can at least be used in an expression index.

That was fun!

create table dat(val text)
insert into dat ( VALUES ('BBB0adam'), ('AAA10fred'), ('AAA2fred'), ('AAA2bob') );

select 
  array_agg( case when z.x[1] ~ E'\\d' then lpad(z.x[1],10,'0') else z.x[1] end ) alnum_key
from (
  SELECT ctid, regexp_matches(dat.val, E'(\\D+|\\d+)','g') as x
  from dat
) z
group by z.ctid
order by alnum_key;

       alnum_key       
-----------------------
 {AAA,0000000002,bob}
 {AAA,0000000002,fred}
 {AAA,0000000010,fred}
 {BBB,0000000000,adam}

Worked on this for almost an hour and posted without looking -- I see Erwin arrived at a similar place. Ran into the same "could not find array type for data type text[]" trouble as @Clodoaldo. Had a lot of trouble getting the cleanup exercise to not agg all the rows until I thought of grouping by the ctid (which feels like cheating really -- and doesn't work on a psuedo table as in the OP example WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') ) ...). It would be nicer if array_agg could accept a set-producing subselect.

I'm not a RegEx guru, but I can work it to some extent. Enough to produce this answer.

It will handle up to 2 numeric values within the content. I don't think OSX goes further than that, if it even handles 2.

WITH parted AS (
  select data,
         substring(data from '([A-Za-z]+).*') part1,
         substring('a'||data from '[A-Za-z]+([0-9]+).*') part2,
         substring('a'||data from '[A-Za-z]+[0-9]+([A-Za-z]+).*') part3,
         substring('a'||data from '[A-Za-z]+[0-9]+[A-Za-z]+([0-9]+).*') part4
    from alnum
)
  select data
    from parted
order by part1,
         cast(part2 as int),
         part3,
         cast(part4 as int),
         data;

SQLFiddle

The following solution is a combination of various ideas presented in other answers, as well as some ideas from the classic solution:

create function natsort(s text) returns text immutable language sql as $$
  select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
  from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;

The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.

Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow