Frage

I have to do searches in "ordered" xml files where my text to retreive is dispached over several nodes like this.

<root>
    <div id="1">Hello</div>
    <div id="2">Hel</div>
    <div id="3">lo dude</div>   
    <div id="4">H</div>
    <div id="5">el</div>
    <div id="6">lo</div>
</root>

The search has to be done on a concatenated text :

HelloHello dudeHello

But I need to be able to retreive nodes attributes. For instance, for a 'll' search, I wish to get the nodes :

<div id="1">Hello</div>
<div id="2">Hel</div>
<div id="3">lo dude</div>   
<div id="5">el</div>
<div id="6">lo</div>

or at least the ids.

Does someone has an idea how to do this in a XPath, or any other means ?

I think it's a bit challenging, I have no (simple) idea for the moment. Thanks for your help.

EDIT : the text must be concatenated before search is a key information and had to be precised !

War es hilfreich?

Lösung

Your updates requirements make the problem much more complex, as the "element wrap" can occur at arbitrary points inside the search token and possibly even span multiple elements. I don't think you will be able to write a query in XPath < 3.0 (if you're able to do it only in XPath anyway). I used XQuery for it, which extends XPath. The code is running fine in BaseX, but should also run in all other XQuery engines (maybe requires XQuery 3.0, didn't have a look at that).

The code got rather complex, I think I put enough comments in there to make it comprehensible. It requires nodes to be inside the next element, but with minor adjustments it can also be used to traverse arbitrary XML structures (think of HTML with <span/>s and other markup).

(: functx dependencies :)
declare namespace functx = "http://www.functx.com";
declare function functx:is-node-in-sequence 
  ( $node as node()? ,
    $seq as node()* )  as xs:boolean {

   some $nodeInSeq in $seq satisfies $nodeInSeq is $node
 } ;
declare function functx:distinct-nodes 
  ( $nodes as node()* )  as node()* {

    for $seq in (1 to count($nodes))
    return $nodes[$seq][not(functx:is-node-in-sequence(
                                .,$nodes[position() < $seq]))]
 } ;

declare function local:search( $elements as item()*, $pattern as xs:string) as item()* {
  functx:distinct-nodes(
    for $element in $elements
    return ($element[contains(./text(), $pattern)], local:start-search($element, $pattern))
  )
};

declare function local:start-search( $element as item(), $pattern as xs:string) as item()* {
    let $splits := (
      (: all possible prefixes of search token :)
      for $i in 1 to string-length($pattern) - 1
      (: check whether element text starts with prefix :)
      where ends-with($element/text(), substring($pattern, 1, $i))
      return $i
    )
    (: go on for all matching prefixes :)
    for $split in $splits
    return
      (: recursive call to next element :)
      let $continue := local:continue-search($element/following-sibling::*[1], substring($pattern, $split+1))
      where not(empty($continue))
      return ($element, $continue)
};

declare function local:continue-search( $element as item()*, $pattern as xs:string) as item()* {
  if (empty($element)) then () else
  (: case a) text node contains whole remaining token :)
  if (starts-with($element/text(), $pattern))
  then ($element)
  (: case b) text node is part of token :)
  else if (starts-with($pattern, $element/text()))
  then
    (: recursive call to next element :)
    let $continue := local:continue-search($element/following-sibling::*[1], substring($pattern, 1+string-length($element/text())))
    where not(empty($continue))
    return ($element, $continue)
  (: token not found :)
  else ()
};

let $token := 'll'
return local:search(//div, $token)

Andere Tipps

In XPath 2 you can use tokenize to count how often the searched text occurs and then test for each node, if not including this node in the text, reduces the number of occurrences. If the number is reduced, that node has to be included in the result. That is not so fast through.

Assuming only the text in the direct child nodes matters, like in the example, it looks like this:

for $searched in "ll" 
return //*/ for $matches in count(tokenize(string-join(*, ""), $searched)) - 1
            return *[$matches > count(tokenize(concat(" ",string-join(preceding-sibling::*, "")), $searched)) +
                                count(tokenize(concat(" ",string-join(following-sibling::*, "")), $searched)) - 2]
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top