Are the results of the STRING_SPLIT() function returned in a deterministic order?

https://dba.stackexchange.com/questions/282790

13-03-2021
|

Question

I need to split a comma delimited string, manipulate it, and then concatenate it back into a single string retaining the original order of the data (if possible).

For example, take a column definition list of a CREATE TABLE statement (as a string) like so 'BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)'. I'd like to split the comma delimited list to a result set, such as using SQL Server's built in function STRING_SPLIT() like so: SELECT TRIM([Value]) AS CoolDataPeople FROM STRING_SPLIT('BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)', ',').

Without specifying an ORDER BY clause, this repeatedly yields (by coincedence?) the following results that appear to be ordered by their same ordering as they are in the string:

Once I have the result set above, I want to apply some additional string manipulations to each row (such as append some constant text), and then concatenate every row back with a function like STRING_AGG() (goodbye days of STUFF ... FOR XML PATH :) in the same order as the original string. So an example of my final result could be 'BrentOzarColumn INTEGER SQLROX, PaulWhiteColumn DATETIME SQLROX, ErikDarlingColumn VARCHAR(100) SQLROX'.

Ultimately my question is: Are the results of the STRING_SPLIT() function returned in a deterministic order? I know without an ORDER BY clause, ordering is not guaranteed when selecting from a dataset like a Table or View, but was wondering if there's a difference with functions?

As I type this out, I have a hunch the answer is no, the ordering is not deterministic therefor I'm not guaranteed the order of the results. Furthermore, I'm betting there's possible additional nondeterminism added for each function I run on top of the results, especially when I combine them back together with STRING_AGG(). (Regardless of the answer, I appreciate your help, and you're all cool data people. ;)

Solution

No, they are not returned in a deterministic order.

While you are unlikely to see them returned in a different order, that doesn't make the current behavior deterministic or reliable. Tricks like applying ROW_NUMBER() OVER (ORDER BY (SELECT 1)) to the output in an intermediate place like a CTE are similarly not guaranteed to work.

It's one of the features notably missing from STRING_SPLIT(), which I've blogged about:

And so has Andy Mallon:

When STRING_SPLIT() falls short

Vote and comment here:

Add row position column to STRING_SPLIT

(This item specifically asks for an additional column to be returned to indicate position within the original string but, due to backward compatibility issues, would likely need to be delivered via a new function, similar to how CONCAT_WS came to be.)

As an aside, the documentation originally stated:

The sort order of the output rows matches the order of the substrings in the input string.

That was deliberately changed in this commit to remove any notion that the return order is promised. Now the documentation states:

The output rows might be in any order. The order is not guaranteed to match the order of the substrings in the input string.

Why would they need to remove that promise?

I am not familiar with the internal implementation, but I believe the original version of the documentation was written by observers, not the function authors. That statement was likely written that way because that's what behavior they observed when they used the function. Many of us do the same thing when we tell people we don't need ORDER BY when selecting rows from a clustered table: "they will always come out in this order." Pretty reliable, until the optimizer chooses a different index.

My guess is they are covering their butts for future behavior. Think about all the changes they've been making to how functions work, and also ongoing changes to the optimizer.

Currently, with the simple case:

SELECT * FROM STRING_SPLIT('cow,dog,dinosaur','');

There isn't much opportunity or reason for that to be anything but sequential. But what about when you OUTER APPLY against an nvarchar(max) column between two huge, partitioned tables, and the data starts getting processed in batch mode and/or in parallel? Would they have to write additional logic in the code to ensure all that output came back in the right order? Would it be worth it? Would they have to revisit it for every future function processing / optimizer change to maintain that promise?

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange