XQuery Full Text search with Word1 and NOT Word2

https://stackoverflow.com/questions/14955164

10-03-2022
|

質問

Following is the XML Structure -

<Docs>
  <Doc>
    <Name>Doc 1</Name>
    <Notes>
        <specialNote>
          This is a special note section. 
           <B>This B Tag is used for highlighting any text and is optional</B>        
           <U>This U Tag will underline any text and is optional</U>        
           <I>This I Tag is used for highlighting any text and is optional</I>        
        </specialNote>      
        <generalNote>
           <P>
            This will store the general notes and might have number of paragraphs. This is para no 1. NO Child Tags here         
           </P>
           <P>
            This is para no 2            
           </P>  
        </generalNote>      
    </Notes>  
    <Desc>
        <P>
          This is used for Description and might have number of paragraphs. Here too, there will be B, U and I Tags for highlighting the description text and are optional
          <B>Bold</B>
          <I>Italic</I>
          <U>Underline</U>
        </P>
        <P>
          This is description para no 2 with I and U Tags
          <I>Italic</I>
          <U>Underline</U>
        </P>      
    </Desc>
</Doc>

There will be 1000's of Doc Tags. I want to give user a search criteria, where he can search WORD1 and NOT WORD2. Following is the query -

for $x in doc('Documents')/Docs/Doc[Notes/specialNote/text() contains text 'Tom' 
ftand  ftnot 'jerry' or 
Notes/specialNote/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/U/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/generalNote/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/U/text() contains text 'Tom' ftand ftnot 'jerry']
return $x/Name

The result of this query is wrong. I mean, the result contains some doc with both Tom and jerry. So I changed the query to -

for $x in doc('Documents')/Docs/Doc[. contains text 'Tom' ftand ftnot 'jerry'] 
return $x/Name

This query gives me the exact result, ie; only those docs with Tom and Not jerry, BUT IS TAKING HUGE TIME... Approx. 45 secs, whereas the earlier one took 10 secs !!

I am using BaseX 7.5 XML Database.

Need expert comments on this :)

解決

The first query tests each text node in the document separately, so Tom and Jerry would match because the first text node contains Tom but not Jerry.

In the second query the full-text search is performed on all the text contents of the Doc elements as if they were concatenated into one string. This cannot (currently) be answered by BaseX's fulltext index, which indexes each text node separately.

A solution would be to perform the fulltext searches for each term separately and merging the results in the end. This can be done for each text node separately, so the index can be used:

for $x in (doc('Documents')/Docs/Doc[.//text() contains text 'Tom']
            except doc('Documents')/Docs/Doc[.//text() contains text 'Jerry'])
return $x/Name

The above query is rewritten by the query optimizer to this equivalent one using two index accesses:

for $x in (db:fulltext("Documents", "Tom")/ancestor::*:Doc
            except db:fulltext("Documents", "Jerry")/ancestor::*:Doc)
return $x/Name

You can even tweak the order in which you are merging the results in order to keep intermediate results small if you want.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow