質問

I have a huge XML file stored in BaseX. Following is the structure of XML nodes

Datas   (Parent Node)
  - Data  (Child of above)
     - Desc  (Child of above)
        - P    (Child of above) and contains the actual text 

P tag contains all the text and I have to count the occurrences of specific word which is inside the P tag.

I have created a Full-Text index. Now to count the occurrence of a specific word, I am using following 2 queries

ft:count(doc('BHCR')/Datas/Data/Desc[. contains text 'revolution'])

This query returns 2177 and took 25 sec.

Another one

ft:count(doc('BHCR')/Datas/Data/Desc[text() contains text 'revolution'])

This query returns 3684 and took 52 millisec.

Which one is right? Can anybody explain the difference between these two queries?

役に立ちましたか?

解決

In your first query, the context item . will lead to a merge of all text nodes of an element (i.e., to the creation of its string value), which will then be used to find full-text tokens. This merge can result in new keywords. As an example, the following query will return true...

<xml><b>H</b>i</xml>/. contains text 'hi'

..where as The following query will return true...

<xml><b>H</b>i</xml> contains text 'hi'

As the new keywords cannot be stored in the full-text index, no index access takes place, and the query will take much longer.

Your second query will return all Desc elements that have revolution in their child text nodes. If you want to parse all descending texts of the Desc elements, the following query will give you the expected results:

ft:count(doc('BHCR')/Datas/Data/Desc[.//text() contains text 'revolution']

The Full Text: Mixed Context Section in the BaseX documentation will give you more details.

Hope this helps, Christian

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top