Haskell: surprising behavior of “groupBy”

https://stackoverflow.com/questions/1316365

19-09-2019
|

Question

I'm trying to figure out the behavior of the library function groupBy (from Data.List), which purports to group elements of a list by an "equality test" function passed in as the first argument. The type signature suggests that the equality test just needs to have type

(a -> a -> Bool)

However, when I use (<) as the "equality test" in GHCi 6.6, the results are not what I expect:

ghci> groupBy (<) [1, 2, 3, 2, 4, 1, 5, 9]
[[1,2,3,2,4],[1,5,9]]

Instead I'd expect runs of strictly increasing numbers, like this:

[[1,2,3],[2,4],[1,5,9]]

What am I missing?

Solution

Have a look at the ghc implementation of groupBy:

groupBy                 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy _  []           =  []
groupBy eq (x:xs)       =  (x:ys) : groupBy eq zs
                           where (ys,zs) = span (eq x) xs

Now compare these two outputs:

Prelude List> groupBy (<) [1, 2, 3, 2, 4, 1, 5, 9]
[[1,2,3,2,4],[1,5,9]]
Prelude List> groupBy (<) [8, 2, 3, 2, 4, 1, 5, 9]
[[8],[2,3],[2,4],[1,5,9]]

In short, what happens is this: groupBy assumes that the given function (the first argument) tests for equality, and thus assumes that the comparison function is reflexive, transitive and symmetric (see equivalence relation). The problem here is that the less-than relation is not reflexive, nor symmetric.

Edit: The following implementation only assumes transitivity:

groupBy' :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy' _   []                        = []
groupBy' _   [x]                       = [[x]]
groupBy' cmp (x:xs@(x':_)) | cmp x x'  = (x:y):ys
                           | otherwise = [x]:r
  where r@(y:ys) = groupBy' cmp xs

OTHER TIPS

The fact that "<" isn't an equality test.

You might expect some behavior because you'd implement it differently, but that isn't what it promises.

An example of why what it outputs is a reasonable answer is if it sweeps through it, doing

[1, 2, 3, 2, 4, 1, 5, 9] ->
[[1,2,3], [2,4], [1,5,9]]

Now has 3 groups of equal elements. So it checks if any of them are in fact the same:

Since it knows all elements in each group is equal, it can just look at the first element in each, 1, 2 and 1.

1 > 2? Yes! So it merges the first two groups.

1 > 1? No! So it leaves the last group be.

And now it's compared all elements for equality.

...only, you didn't pass it the kind of function it expected.

In short, when it wants an equality test, give it an equality test.

The problem is that the reference implementation of groupBy in the Haskell Report compares elements against the first element, so the groups are not strictly increasing (they just have to be all bigger than the first element). What you want instead is a version of groupBy that tests on adjacent elements, like the implementation here.

I'd just like to point out that the groupBy function also requires your list to be sorted before being applied.

For example:

equalityOp :: (a, b1) -> (a, b2) -> Bool
equalityOp x y = fst x == fst y

testData = [(1, 2), (1, 4), (2, 3)]

correctAnswer = groupBy equalityOp testData == [[(1, 2), (1, 4)], [(2, 3)]]

otherTestData = [(1, 2), (2, 3), (1, 4)]

incorrectAnswer = groupBy equalityOp otherTestData == [[(1, 2)], [(2, 3)], [(1, 4)]]

This behaviour comes about because groupBy is using span in its definition. To get reasonable behaviour which doesn't rely on us having the underlying list in any particular order we can define a function:

groupBy' :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy' eq []     = []
groupBy' eq (x:xs) = (x:similarResults) : (groupBy' eq differentResults)
    where similarResults   = filter (eq x) xs
          differentResults = filter (not . eq x) xs

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow