Problems with hyphen in Jackrabbit XPath query

https://stackoverflow.com/questions/3572258

01-10-2019
|

Question

Firstly, let me just say that I'm very new to JSR-170 and Jackrabbit/Lucene in general.

I have the following XPath query:

//*[@sling:resourceType="users/user-profile" and jcr:contains(*/*/*,'sophie\-a')] order by @jcr:score descending

I have a user named Sophie-Allen and a user named Sophie-Anne. Searching using the above query returns zero results, where searching for 'sophie' alone returns both users. I understand that the hyphen means exclude in JSR-170, but I've escaped it (as you can see above).

Why is this query not returning both users?

Another strange thing is when I use asterisks (the hyphens are all escaped when executed):

Searching for 'sophie-allen' returns Sophie-Allen's record.
Searching for 'soph*' returns both Sophie-Allen and Sophie-Anne.
Searching for 'sophie-a* returns nothing.
Searching for 'sophie-allen*' returns nothing.

I understand that with jcr:contains, technically you don't need to use asterisks, but looking at the above behaviour, it seems to have some sort of effect.

Is there something else that I'm missing with regards to hyphens and asterisks in XPath queries and searching a JCR? I've googled everything I can think of and read through the spec, but can't seem to find anything that answers my question.

Thanks in advance.

Edit: It looks like a 'phrase query' doesn't work with jcr:contains (anymore?) as the default Lucene Analyzer tokenizes on the hyphen, meaning it splits 'sophie-allen' to sophie and allen.

Edit 2: I've tried using a custom analyzer and tokenizer as suggested by someone on the Jackrabbit Users list, but that hasn't helped either, Lucene is still taking the hyphen and omitting the results I want.

Solution 2

While working on this with a colleague, we discovered this JIRA for ModeShape, incidentally logged by Randall (who answered here too). It turns out that the problem is caused by the fact that jackrabbit isn't handling a wildcard in a search term with a wildcard properly/too well.

Randall had done a fix for ModeShape but my colleagues and project team nominated not to fix our problem at this stage as the use of Jackrabbit was not 100% certain.

I'd like to associate the answer to this question to Randall, but his post isn't the actual answer. I'll mark this post as the answer, unless Randall comes along and posts something.

OTHER TIPS

You are correct that Lucene does split "sophie-allen" into two tokens, but those tokens are adjacent. You said you've tried a phrase expression like this:

... jcr:contains(*/*/*,'"sophie-a*"') ...

This should work by finding the token "sophie" followed by another token containing 'a' as the first character. Because the same analyzer used during indexing should be used to tokenize this phrase expression, the '-' character will still be used as a delimiter [1]. (Note that if you're specifying your XPath expression in Java code, you'd have to escape the double-quote characters with a preceding backslash.)

However, if this does not work, you might try taking out the hyphen in this expression. Because you're using wildcards, the logic might be incorrectly tokenizing the wildcard expression. In other words, try:

... jcr:contains(*/*/*,'"sophie a*"') ...

Of course, without wildcards, this would probably work (with or without the hyphen):

... jcr:contains(*/*/*,'"sophie-allen"') ...

Good luck!

P.S. I've not verified that this works in Jackrabbit, but it does work in ModeShape (which also uses Lucene).

[1] The exact rules depend on the tokenizer. For example, the StandardTokenizer filters out English stop words, but tokenizes the '-' character except when there's a number in the token (in which case the whole token is interpreted as a product and is not split.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow