In-Place Quicksort in matlab

https://stackoverflow.com/questions/7228613

15-01-2021
|

Вопрос

I wrote a small quicksort implementation in matlab to sort some custom data. Because I am sorting a cell-array and I need the indexes of the sort-order and do not want to restructure the cell-array itself I need my own implementation (maybe there is one available that works, but I did not find it).

My current implementation works by partitioning into a left and right array and then passing these arrays to the recursive call. Because I do not know the size of left and and right I just grow them inside a loop which I know is horribly slow in matlab.

I know you can do an in place quicksort, but I was warned about never modifying the content of variables passed into a function, because call by reference is not implemented the way one would expect in matlab (or so I was told). Is this correct? Would an in-place quicksort work as expected in matlab or is there something I need to take care of? What other hints would you have for implementing this kind of thing?

Решение

Implementing a sort on complex data in user M-code is probably going to be a loss in terms of performance due to the overhead of M-level operations compared to Matlab's builtins. Try to reframe the operation in terms of Matlab's existing vectorized functions.

Based on your comment, it sounds like you're sorting on a single-value key that's inside the structs in the cells. You can probably get a good speedup by extracting the sort key to a primitive numeric array and calling the builtin sort on that.

%// An example cell array of structs that I think looks like your input
c = num2cell(struct('foo',{'a','b','c','d'}, 'bar',{6 1 3 2}))
%// Let's say the "bar" field is what you want to sort on.
key = cellfun(@(s)s.bar, c) %// Extract the sort key using cellfun
[sortedKey,ix] = sort(key) %// Sort on just the key using fast numeric sort() builtin
sortedC = c(ix); %// ix is a reordering index in to c; apply the sort using a single indexing operation
reordering = cellfun(@(s)s.foo, sortedC)  %// for human readability of results

If you're sorting on multiple field values, extract all the m key values from the n cells to an n-by-m array, with columns in descending order of precedence, and use sortrows on it.

%// Multi-key sort
keyCols = {'bar','baz'};
key = NaN(numel(c), numel(keyCols));
for i = 1:numel(keyCols)
    keyCol = keyCols{i};
    key(:,i) = cellfun(@(s)s.(keyCol), c);
end
[sortedKey,ix] = sortrows(key);
sortedC = c(ix);
reordering = cellfun(@(s)s.foo, sortedC)

One of the keys to performance in Matlab is to get your data in primitive arrays, and use vectorized operations on those primitive arrays. Matlab code that looks like C++ STL code with algorithms and references to comparison functions and the like will often be slow; even if your code is good in O(n) complexity terms, the fixed cost of user-level M-code operations, especially on non-primitives, can be a killer.

Also, if your structs are homogeneous (that is, they all have the same set of fields), you can store them directly in a struct array instead of a cell array of structs, and it will be more compact. If you can do more extensive redesign, rearranging your data structures to be "planar-organized" - where you have a struct of arrays, reading across the ith elemnt of all the fields as a record, instead of an array of structs of scalar fields - could be a good efficiency win. Either of these reorganizations would make constructing the sort key array cheaper.

Другие советы

In this post, I only explain MATLAB function-calling convention, and am not discussing the quick-sort algorithm implementation.

When calling functions, MATLAB passes built-in data types by-value, and any changes made to such arguments are not visible outside the function.

function y = myFunc(x)
    x = x .* 2;         %# pass-by-value, changes only visible inside function
    y = x;
end

This could be inefficient for large data especially if they are not modified inside the functions. Therefore MATLAB internally implements a copy-on-write mechanism: for example when a vector is copied, only some meta-data is copied, while the data itself is shared between the two copies of the vector. And it is only when one of them is modified, that the data is actually duplicated.

function y = myFunc(x)
    %# x was never changed, thus passed-by-reference avoiding making a copy
    y = x .* 2;
end

Note that for cell-arrays and structures, only the cells/fields modified are passed-by-value (this is because cells/fields are internally stored separately), which makes copying more efficient for such data structures. For more information, read this blog post.

In addition, versions R2007 and upward (I think) detects in-place operations on data and optimizes such cases.

function x = myFunc(x)
    x = x.*2;
end

Obviously when calling such function, the LHS must be the same as the RHS (x = myFunc(x);). Also in order to take advantage of this optimization, in-place functions must be called from inside another function.

In MEX-functions, although it is possible to change input variables without making copies, it is not officially supported and might yield unexpected results...

For user-defined types (OOP), MATLAB introduced the concept of value object vs. handle object supporting reference semantics.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow