How to make word concordance with Solr?

https://stackoverflow.com/questions/14920273

10-03-2022
|

Вопрос

I would like to create a word concordance hit list with Solr, which gives all occurrences of the given word with context.

An English example:

...bla bla1 <b>dog</b> bla bla 1...
...bla bla2 <b>dog</b> bla bla 2...
...bla bla3 <b>dogs</b> bla bla 3
...bla bla4 <b>dogging</b> bla bla 4...
...bla bla5 <b>dog</b> bla bla 5...

It's important to be able to customize the size of the context. (Sometimes more than 1 sentence.)

My question: how can i do this with Solr?

Lucene 4.1 is able to do this, for example with FastVectorHighlighter:

    //indexing
    FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
    offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    offsetsType.setStored(true);
    offsetsType.setIndexed(true);   
    offsetsType.setStoreTermVectors(true);
    offsetsType.setStoreTermVectorOffsets(true);
    offsetsType.setStoreTermVectorPositions(true);
    offsetsType.setStoreTermVectorPayloads(true);

    doc.add(new Field("content", fileContent, offsetsType));


    //searching
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = StandardAnalyzer(Version.LUCENE_41);
    QueryParser parser = new QueryParser(Version.LUCENE_41, "content", analyzer);
    Query query = parser.parse("dog");
    TopDocs results = searcher.search(query, 10);

    for (int i = 0; i < results.scoreDocs.length; i++) {
            int id = results.scoreDocs[i].doc;
            Document doc = searcher.doc(id);
            FastVectorHighlighter h = new FastVectorHighlighter();
            String[] hs = h.getBestFragments(h.getFieldQuery(query), reader, id, "content", contextSize, 10000);
            if (hs != null)
                    for(String f : hs)
                        System.out.println(" highlight: " + f);
    }

But how can i ask Solr to do the same?

My trial was this (solrconfig.xml):

<fragmentsBuilder name="colored" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder">
 <lst name="defaults">
 <str name="hl.tag.pre"><![CDATA[
      <b style="background:yellow">,<b style="background:lawgreen">,
      <b style="background:aquamarine">,<b style="background:magenta">,
      <b style="background:palegreen">,<b style="background:coral">,
      <b style="background:wheat">,<b style="background:khaki">,
      <b style="background:lime">,<b style="background:deepskyblue">]]></str>
 <str name="hl.tag.post"><![CDATA[</b>]]></str>
 </lst>
</fragmentsBuilder>

<requestHandler name="drupal" class="solr.SearchHandler" default="true">
...
<str name="hl">true</str>
<str name="hl.fl">content</str>
<int name="hl.snippets">5000</int>
<int name="hl.fragsize">300</int>
<str name="hl.simple.pre"><![CDATA[ <b style="background:yellow"><i> ]]></str>
<str name="hl.simple.post"><![CDATA[ </i></b> ]]></str>
<str name="hl.mergeContiguous">true</str>
<str name="hl.fragListBuilder">single</str>
<str name="hl.useFastVectorHighlighter">true</str>

But it always gives one great fragment (for each doc), but not with all occurrences.

Thanks, Steve

Решение

Can you try with hl.fragsize=100 and hl.mergeContiguous=false and see how many fragments you get?

(Before adding the params directly in your SearchHandler in solrconfig.xml you can try various options by specifying all your params in query. Once you find a set of params you are happy with, use those in solrconfig.)

Другие советы

I just contributed a patch http://issues.apache.org/jira/i#browse/LUCENE-5317 that might be of interest. A Solr-wrapper is on its way.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow