xquery: how to get distinct values of nodes

https://stackoverflow.com//questions/12693171

12-12-2019
|

Question

I'd like to know if there is any function for Xquery similar to distinct-values but that returns a node.

Let me be clearer: for example I have a bibliography and for each author in it I want to list all the books he wrote. The author element in my specific case is like this:

<author>
  <last> Shakespear </last>
  <first> William </first>
</author>

Using distinct-values on author returns ShakespearWilliam so as far as I can see it does not help. I'd like a function that preserves the structure of the element author without considering duplicates.

If you find another way for the query then let me know. Does anyone have any idea?

Solution

XQuery 3.0 has a "group by" construct, and this allows you for example to group authors by the value of (first name, last name). When you have grouped the nodes, you essentially have your answer: nodes are distinct if and only if they are in different groups.

There are quite a few products around that implement this part of the XQuery 3.0 draft; Saxon 9.4 is one of them.

OTHER TIPS

A problem with getting distinct nodes is how to determine that two nodes are distinct. This is a complex topic in XML. If the duplicate nodes will have the same node identity (i.e.: they reference the same node), then you can use a function like functx:distinct-nodes(). Otherwise, you need some type of hash to determine if the nodes are "equal enough" to be considered equal, or compare using deep-equal(), which will perform poorly for large datasets.

If two <author>s are equal when the last and first name are the same, then you could use something as simple as concat(last,first) as a hash and get distinct values using xpath:

$xml/author[index-of($xml/author/concat(last,first), concat(last,first))[1]]

This still isn't ideal because you are computing the hash at every step, so it will slow down for large datasets. To improve performance, one thing you can do is pre-compute the hashes on your data, i.e.:

<author hash="ShakespearWilliam">
  <last>Shakespear</last>
  <first>William</first>
</author>

and:

$xml/author[index-of($xml/author/@hash, @hash)[1]]

If you can efficiently get ordered nodes by hash (ideally using an ordered database index), then there is a more efficient method of removing duplicates:

declare function local:nodupes($first, $rest)
{
    if (empty($rest)) then $first
    else if ($first/@hash eq $rest[1]/@hash)
    then local:nodupes($rest[1], subsequence($rest,2))
    else ($first, local:nodupes($rest[1], subsequence($rest,2)))
};

Then call that with your ordered set:

let $ordered :=
  for $a in $xml/author
  order by $a/@hash
  return $a
return 
  local:nodupes((),$ordered)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow