Question

I have a bunch of documents in a MarkLogic xml database. One document has:

<colors>
  <color>red</color>
  <color>red</color>
</colors>

Having multiple colors is not a problem. Having multiple colors that are both red is a problem. How do I find the documents that have duplicate data?

Was it helpful?

Solution

Everything MarkLogic returns is just a sequence of nodes, so we can count the sequence size of the whole and compare it to the count of the sequence of distinct values. If they're not distinct, they're duplicate, and you have your subset.

for $c in doc()//colors
where fn:count($c/color) != fn:count(fn:distinct-values($c/color))
return $c

OTHER TIPS

Or you could do it completely out of indexes :)

for $c in doc()//colors is likely to create an EXPANDED TREE CACHE error on larger data sets.

Here is a slightly more complicated way to attack this when the data is huge, make sure the URI Lexicon is turned on and then add a element range index on the element color and compute the distinct color values that have duplication somewhere. Then loop over only the documents that have this color one by one and compute the item-frequency counts of the colors of interest in the documents. If you get a frequency over 1, this document needs de-duplication.

let $qn := xs:QName("color")
let $colorsWithItemFreq := cts:element-values($qn, (), ("ascending", "item-order", "item-frequency"))
let $colorsOfInterest := 
    for $color at $i in cts:element-values($qn, (), ("ascending", "item-order", "fragment-frequency"))
    let $fragFrequency := cts:frequency($color)
    let $itemFrequency := cts:frequency($colorsWithItemFreq[$i])
    where $itemFrequency gt $fragFrequency
    return 
        $color

for $uri in cts:uris( (), ("document"), cts:element-value-query($qn, $colorsOfInterest)
let $colorsWithDuplicationInThisDoc :=
    for $color in cts:element-values($qn, (), ("item-frequency"), cts:document-query($uri) )
    where $color = $colorsOfInterest and cts:frequency($color) gt 1
    return
        $color
where fn:count( $colorsWithDuplicationInThisDoc ) gt 1
return
    $uri

Hope that helps.

For this XML:

<?xml version="1.0"?>
<colors>
    <color>Red</color>
    <color>Red</color>
    <color>Blue</color>
</colors>

Using this XSD:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method = "text" />  
    <xsl:strip-space elements="*"/>

    <xsl:template match="colors">

        <xsl:for-each select="color">
            <xsl:variable name="node_color" select="text()"/>
            <xsl:variable name="numEntries" select="count(../color[text()=$node_color])"/>
            <xsl:if test="$numEntries &gt; 1">
                <xsl:text>Color value of </xsl:text><xsl:value-of select="."/><xsl:text> has multiple entries &#xa;</xsl:text>      
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

I got this output:

Color value of Red has multiple entries 
Color value of Red has multiple entries 

So that will at least find them, but it will report each occurrence of a repeated color, not just every repeated color.

This should do the trick. I am not too familiar with MarkLogic, so the first line to get the set of documents may be wrong. This will return all documents which have 2 or more color elements with the same string value.

for $doc in doc()
let $colors = $doc//color/string(.)
where some $color in $colors
      satisfies count($colors[. = $color] > 1)
return doc()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top