Pergunta

I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".

For example, in this string:

<span class="bob">Blacklisted word was here</span>bass

I tried that regex:

((?!class)ass)

That matches the "ass" in the word "bass" bot NOT "class". This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.

NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.

Foi útil?

Solução

It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.

The trick here is to know where to use the lookbehind:

/ass(?<!class)/

In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.

You can even get some of them in a row:

/ass(?<!class)(?<!pass)(?<!bass)/

This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:

/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/

UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.

Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).

Outras dicas

If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:

(?<!cl)(ass)

Without lookbehind, you probably need to do something like this:

(?:(?!cl)..|^.?)(ass)

That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.

Note that this is probably not the best way to implement a blacklist, though. You probably want this:

\bass\b

Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).

Is this one is what you want ? (?<!class)(\w+ass)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top