Question

I have a Solr instance where I index web pages, and I want to be able to query on parts of the url. E.g. en.wikipedia.org/wiki/Main_Page should also match en.wikipedia.org by itself.

To do this, I have made a field called url_tokens, which gets copied over from my url field, and which gets analyzed using the PathHierarchyTokenizerFactory at index time.

I thought that the url_tokens field would contain en.wikipedia.org/wiki/Main_Page, en.wikipedia.org/wiki and en.wikipedia.org, but this is the result I get from the Solr admin query interface:

...
"url": "http://en.wikipedia.org/wiki/Main_Page",
"url_tokens": [
  "http://en.wikipedia.org/wiki/Main_Page"
],
...

What am I doing wrong?

These are the relevant parts of my schema.xml:

<field name="url_tokens" type="url_tokens_type" indexed="true" stored="true" multiValued="true"/>

<field name="url" type="url" indexed="true" stored="true"/>

<copyField source="url" dest="url_tokens"/>

<fieldType name="url" class="solr.TextField" positionIncrementGap="100">                                                                                  
  <analyzer>                                                                                                   
    <tokenizer class="solr.StandardTokenizerFactory"/>                                                       
    <filter class="solr.LowerCaseFilterFactory"/>                                                            
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>                                                      
  </analyzer>                                                                                                  
</fieldType>           

<fieldType name="url_tokens_type" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
</fieldType>
Was it helpful?

Solution

I found the answer. Things were working fine with my setup, it was just that I expected the wrong output.

I expected that since I had tokenized the field using the PathHierarchyTokenizerFactory and the field was multivalued, I would get a result of

"url_tokens": [
  "http://en.wikipedia.org/wiki/Main_Page"
  "http://en.wikipedia.org/wiki"
  "http://en.wikipedia.org"
],

But the reason I got

"url_tokens": [
  "http://en.wikipedia.org/wiki/Main_Page"
],

in the search results was because the field was stored. The tokenization happens because the field was also indexed, but these tokens never show up in the search results, they are only used to select which results to show.

I had not previously used the anaysis screen of the solr admin GUI, but I have used it to confirm that the urls are tokenized correctly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top