Lucene Highlighter class: highlight different words in different colors

Question

I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.

But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with <B>s and </B>s in it.

A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()

I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.

So you set up a map before to hold your "markup details":

def markupDetails = [:]

Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.

def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
    def termQueryScorer = new QueryScorer( tq )
    def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
    TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
    String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
    // not sure under what circs you get > 1 fragment...
    assert frags.size() <= 1
    // NB you don't always get all terms in all returned LDocuments... 
    if( frags.size() ) {
        String highlightedFrag = frags[ 0 ]
        Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
        def pos = 0
        def previousEnd = 0
        while( boldTagMatcher.find()) {
            pos += boldTagMatcher.start() - previousEnd
            previousEnd =  boldTagMatcher.end()
            markupDetails[  pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
        }
    }
}

As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.

... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:

    markupDetails.sort().reverseEach{ pos, markup ->
        String firstPart = text.substring( 0, pos )
        String secondPart = text.substring( pos )
        text = firstPart + markup + secondPart
    }

... at the end of which text contains your marked-up String: print to console. Lovely.