How to apply an aggregating function to groups of table (or dataset) rows?

https://stackoverflow.com/questions/21616825

08-10-2022
|

Question

Suppose I have some table (or dataset) object A that includes some factor X (having n distinct values, or "levels") among its variables. Also suppose I have some custom aggregating function agg that takes as input a sequence of rows, and returns a single summary number (aka "aggregate") as output.

An example of an extremely common operation in statistically data analysis would be to group the rows of A according to the value of the X factor, and to apply agg to the arrays consisting of the rows in the group. The result of such an operation should be a new table (or dataset, depending on the class of A), with n rows and two variables. The first one of these variables should be named X, and should contain the n distinct values of the X factor in A, and the second should have some suitable (preferably user-specified) name, and should contain the result of applying agg to the row groups for the corresponding X levels.

As I said, this is a very standard operation to perform on data structures such as MATLAB's table and dataset objects, so I was expecting there'd be a built-in way to do it, but I'm not finding it.

For example, let A be as defined below:

% "data" table
A = cell2table({
                'even', 'red', 'spades', 38, 0.9837;
                'even', 'red', 'hearts', 19, 0.5695;
                'even', 'red', 'diamonds', 89, 0.2629;
                'even', 'red', 'diamonds', 98, 0.3578;
                'even', 'red', 'diamonds', 92, 0.2596;
                'even', 'red', 'diamonds', 69, 0.5751;
                'even', 'red', 'diamonds', 77, 0.6318;
                'even', 'yellow', 'clubs', 22, 0.6917;
                'even', 'green', 'spades', 35, 0.6674;
                'even', 'green', 'hearts', 67, 0.7896;
                'even', 'green', 'hearts', 49, 0.5025;
                'even', 'green', 'hearts', 64, 0.5318;
                'odd', 'red', 'spades', 22, 0.5587;
                'odd', 'red', 'hearts', 51, 0.9122;
                'odd', 'red', 'diamonds', 74, 0.3343;
                'odd', 'red', 'diamonds', 69, 0.2911;
                'odd', 'yellow', 'spades', 33, 0.2653;
                'odd', 'yellow', 'spades', 38, 0.2549;
                'odd', 'yellow', 'diamonds', 1, 0.2064;
                'odd', 'yellow', 'diamonds', 25, 0.8257;
                'odd', 'green', 'spades', 64, 0.4348;
                'odd', 'green', 'hearts', 59, 0.8644;
                'odd', 'green', 'hearts', 4, 0.6374;
                'odd', 'green', 'hearts', 11, 0.3354
               }, 'VariableNames', ...
               {'Parity', 'TrafficLight', 'Suit', 'order', 'prevalence'});

Also, let X be TrafficLight, and agg be

agg = @(t) size(t, 1);

(Of course, I'm using this agg here just to keep the example as simple as possible. In practice agg will be far less simple-minded.)

The typical group_aggregate function I'm thinking of usually takes as input arguments (in some order) an aggregating function, the name of the column in the output for the computed aggregates, a table (or dataset), and the names of one or more variables to group by. Therefore, in this example, the call to such a function, and its output, would look something like this

>> group_aggregate(agg, 'nrows', A, {'TrafficLight'})

ans =

    TrafficLight    nrows
    ____________    _____

    'green'          8   
    'red'           11   
    'yellow'         5

BTW, to get the result above, I conjured up this desperate little beast:

>> tmp = cellfun(@(s) {s agg(A(strcmp(A.TrafficLight, s), :))}, ...
unique(A.TrafficLight), 'un', 0);
>> cell2table(cat(1, tmp{:}), 'VariableNames', {'TrafficLight' 'nrows'})

I hope that a built-in solution is more robust to, for example, different classes of values for the X variable, etc.

Solution

I don't know of a table method just for this purpose, but for aggregation, look to accumarray:

>> [lights,ia,ic]=unique(A.TrafficLight);
>> nrows = accumarray(ic,1);
>> cell2table([lights num2cell(nrows)],'VariableNames', {'TrafficLight' 'nrows'})
ans = 
    TrafficLight    nrows
    ____________    _____
    'green'          8   
    'red'           11   
    'yellow'         5

OTHER TIPS

I know the question is 1 year old...

I use a different way that accumarray. I use tabulate http://www.mathworks.com/help/stats/tabulate.html?s_tid=gn_loc_drop

I also still have an older version of Matlab and I use therefore dataset. In this case (but I think it might also work for 'table'), your code would be:

TrafficLight_stats=tabulate(A.TrafficLight);
Agg_table=mat2dataset(TrafficLight_stats,'VarNames',{'TrafficLight','nrows','Perc'});

Agg_table = 

TrafficLight    nrows       Perc         
'red'           [11]        [45.8333]
'yellow'        [ 5]        [20.8333]
'green'         [ 8]        [33.3333]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow