Question

i'm writing my anti spam/badwors filter and i need if is possible,

to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends

is this possible with regex!?

best regards!

Was it helpful?

Solution

Of course it's possible with regex! You're not asking to match nested parentheses! :P

But yes, this is the kind of thing regular expressions were built for. An example:

/\S*[^\w\s]+\S*/

This will match all of the following:

@ss
as$
a$s
@$s
a$$
@s$
@$$

It will not match this:

ass

Which I believe is what you want. How it works:

\S* matches 0 or more non-space characters. [^\w\s]+ matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the \S* again matches 0 or more non-space characters (symbols and letters).

If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:

$a = /[aA@]/ # regex that matches all a-like symbols
$b = /[bB]/
$c = /[cC(]/
# etc...

Or:

$regex = array( 'a' => /[aA@]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );

So that way, you can match "friend" in all its permutations with:

/$f$r$i$e$n$d/

Or:

/$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/

Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.

OTHER TIPS

It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.

I would guess that you would have a bunch of regex rules to detect bad words like so:

To detect fr1&nd$, friends, fr**nd* you can use a regex like:

/fr[1iI*][&eE]nd[s$Sz]/

Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.

(I'm assuming for a badwords filter you would want friend as well as frie**, you may want to mask the bad word as well as all possible permutations)

Didn't test this thoroughly, but this should do it:

(\w+)*(?<=[^A-Za-z ])

You could build some regular expressions like the following:

\p{L}+[\d\p{S}]+\S*

This will match any sequence of one or more letters (\p{L}+, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+) and any following non-whitespace characters \S*.

$str = 'fr1&nd$ and not friends';
preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match);
var_dump($match);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top