سؤال

I'm trying to make search keywords bold in result titles by replacing each keyword with <b>kw</b> using replaceAll() method. Also need to ignore any special characters in keywords for highlight. This is the code I'm using but it is double replacing the bold directive in second pass. I am looking for a elegant regex solution since my alternative is becoming too big without covering all cases. For example, with this input:

addHighLight("a b", "abacus") 

...I get this result:

<<b>b</b>>a</<b>b</b>><b>b</b><<b>b</b>>a</<b>b</b>>cus

public static String addHighLight(String kw, String text) {
    String highlighted = text;
    if (kw != null && !kw.trim().isEmpty()) {
        List<String> tokens = Arrays.asList(kw.split("[^\\p{L}\\p{N}]+"));
        for(String token: tokens) {
            try {
                highlighted = highlighted.replaceAll("(?i)(" + token + ")", "<b>$1</b>");
            } catch ( Exception e) {
                e.printStackTrace();
            }
        }
    }
    return highlighted;
}
هل كانت مفيدة؟

المحلول

  1. Don't forget to use Pattern.quote(token) (unless non-regex-escaped kw is guaranteed)
  2. If you're bound to use replaceAll() (instead of tokenizing input into tag|text|tag|text|... and applying replace to texts only, which would've been much simpler and faster) - below code should help

Note that it's not efficient - it matches some empty or already-highlighted spots and thus requires "curing" after substitution, but should treat XML/HTML tags (except CDATA) properly.

Here's a "curing" function (no null checks):

private static Pattern cureDoubleB = Pattern.compile("<b><b>([^<>]*)</b></b>");
private static Pattern cureEmptyB = Pattern.compile("<b></b>");
private static String cure(String input) {
    return cureEmptyB.matcher(cureDoubleB.matcher(input).replaceAll("<b>$1</b>")).replaceAll("");
}

Here's how the replaceAll line should look like:

String txt = "[^<>" + Pattern.quote(token.substring(0, 1).toLowerCase()) + Pattern.quote(token.substring(0, 1).toUpperCase()) +"]*";
highlighted = cure(highlighted.replaceAll("((<[^>]*>)*"+txt+")(((?i)" + Pattern.quote(token) + ")|("+txt+"))", "$1<b>$4</b>$5"));

نصائح أخرى

Since you're already excluding special characters from your keywords, the simplest way around this might just be to add a bit more to your search pattern. The following should prevent you from matching text that's already part of an html tag:

highlighted = highlighted.replaceAll("(?i)[^<](" + token + ")", "<b>$1</b>");

This code worked for me with minimum changes using regex lookbehind

highlighted = highlighted.replaceAll("(?i)((?<!<)(?<!/)" + token + "(?<!>))", "<b>$1</b>");
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top