Question

I am currently struggling with my "dirty word" filter finding partial matches.

example: if I pass in these two params replaceWord("ass", "passing pass passed ass")

to this method

private static String replaceWord(String word, String input) {
    Pattern legacyPattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
    Matcher matcher = legacyPattern.matcher(input);
    StringBuilder returnString = new StringBuilder();
    int index = 0;
    while(matcher.find()) {
        returnString.append(input.substring(index,matcher.start()));
        for(int i = 0; i < word.length() - 1; i++) {
            returnString.append('*');
        }
        returnString.append(word.substring(word.length()-1));

        index = matcher.end();
    }
    if(index < input.length() - 1){
        returnString.append(input.substring(index));
    }
    return returnString.toString();
}

I get p*sing p*s p**sed **s

When I really just want "passing pass passed **s. Does anyone know how to avoid this partial matching with this method?? Any help would be great thanks!

Was it helpful?

Solution

This tutorial from Oracle should point you in the right direction.

You want to use a word boundary in your pattern:

Pattern p = Pattern.compile("\\bword\\b", Pattern.CASE_INSENSITIVE);

Note, however that this still is problematic (as profanity filtering always is). A "non-word character" that defines the boundary is anything not included in [0-9A-Za-z_]

So for example, _ass would not match.

You also have the problem of profanity derived terms ... where the term is prepended to say, "hole", "wipe", etc

OTHER TIPS

I'm working on a dirty word filter as we speak, and the option I chose to go with was Soundex and some regex.

I first filter out strange character with \w which is [a-zA-Z_0-9].

Then use soundex(String) to make a string that you can check against the soundex string of the word you want to test.

 String soundExOfDirtyWord = Soundex.soundex(dirtyWord);
 String soundExOfTestWord = Soundex.soundex(testWord);
 if (soundExOfTestWord.equals(soundExOfDirtyWord)) {
     System.out.println("The test words sounds like " + dirtyWord);
 }

I just keep a list of dirty words in the program and have SoundEx run through them to check. The algorithm is something worth looking at.

You could also use replaceAll() method from the Matcher class. It replaces all the occurences of the pattern with your specified replacement word. Something like below.

    private static String replaceWord(String word, String input) {
        Pattern legacyPattern = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
        Matcher matcher = legacyPattern.matcher(input);
        String replacement = "";
        for (int i = 0; i < word.length() - 1; i++) {
           replacement += "*";
        }
        replacement += word.charAt(word.length() - 1);
        return matcher.replaceAll(replacement);
    }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top