UCA + Natural Sorting

https://stackoverflow.com/questions/5056586

15-11-2019
|

Question

I recently learnt that PHP already supports the Unicode Collation Algorithm via the intl extension:

$array = array
(
    'al', 'be',
    'Alpha', 'Beta',
    'Álpha', 'Àlpha', 'Älpha',
    'かたかな',
    'img10.png', 'img12.png',
    'img1.png', 'img2.png',
);

if (extension_loaded('intl') === true)
{
    collator_asort(collator_create('root'), $array);
}

Array
(
    [0] => al
    [2] => Alpha
    [4] => Álpha
    [5] => Àlpha
    [6] => Älpha
    [1] => be
    [3] => Beta
    [11] => img1.png
    [9] => img10.png
    [8] => img12.png
    [10] => img2.png
    [7] => かたかな
)

As you can see this seems to work perfectly, even with mixed case strings! The only drawback I've encountered so far is that there is no support for natural sorting and I'm wondering what would be the best way to work around that, so that I can merge the best of the two worlds.

I've tried to specify the Collator::SORT_NUMERIC sort flag but the result is way messier:

collator_asort(collator_create('root'), $array, Collator::SORT_NUMERIC);

Array
(
    [8] => img12.png
    [7] => かたかな
    [9] => img10.png
    [10] => img2.png
    [11] => img1.png
    [6] => Älpha
    [5] => Àlpha
    [1] => be
    [2] => Alpha
    [3] => Beta
    [4] => Álpha
    [0] => al
)

However, if I run the same test with only the img*.png values I get the ideal output:

Array
(
    [3] => img1.png
    [2] => img2.png
    [1] => img10.png
    [0] => img12.png
)

Can anyone think of a way to preserve the Unicode sorting while adding natural sorting capabilities?

Solution

After digging a little more in the documentation I've found the solution:

if (extension_loaded('intl') === true)
{
    if (is_object($collator = collator_create('root')) === true)
    {
        $collator->setAttribute(Collator::NUMERIC_COLLATION, Collator::ON);
        $collator->asort($array);
    }
}

Output:

Array
(
    [0] => al
    [3] => Alpha
    [5] => Álpha
    [6] => Àlpha
    [7] => Älpha
    [1] => be
    [4] => Beta
    [10] => img1.png
    [11] => img2.png
    [8] => img10.png
    [9] => img12.png
    [2] => かたかな
)

OTHER TIPS

This is trivially done. You simply preprocess the list to zero-pad numbers. For example, using my ucsort script, which supports the UCA, on this list of filenames:

% cat /tmp/numfiles
img4.png
img1.png
img2.png
img12.png
img21.png
img10.png
img20.png
img3.png
img22.png

will produce the desired output by using the Unicode::Collate module’s --preprocess hook to transform runs of digits into zero-padded ones:

% ucsort --preprocess='s/(\d+)/sprintf "%020d", $1/ge' /tmp/numfiles
img1.png
img2.png
img3.png
img4.png
img10.png
img12.png
img20.png
img21.png
img22.png

Looking at the PHP documentation you cite, it does not appear that that PHP library supports the full UCA tailoring possibilities that the Perl Unicode::Collate module supports. In fact, it looks more like Perl’s Unicode::Collate::Locale module, except that the PHP library code does not seem to support the inherited collation options that the Perl code does.

I suppose that if all else fails, you could call Perl code to do what needs done.

Based on the answer of @tchrist I've came up with this:

function sortIntl($array, $natural = true)
{
    $data = $array;

    if ($natural === true)
    {
        $data = preg_replace_callback('~([0-9]+)~', 'natsortIntl', $data);
    }

    collator_asort(collator_create('root'), $data);

    return array_intersect_key($array, $data);
}

function natsortIntl($number)
{
    return sprintf('%020d', $number);
}

Output:

Array
(
    [0] => 1
    [1] => 100
    [2] => al
    [3] => be
    [4] => Alpha
    [5] => Beta
    [6] => Álpha
    [7] => Àlpha
    [8] => Älpha
    [9] => かたかな
    [10] => img1.png
    [11] => img2.png
    [12] => img10.png
    [13] => img20.png
)

Still hoping for a better solution though.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow