Question

I am trying to make a swearing prevention system, so far I have ignored the whitespace (with "\s*") and I've ignored the case("(?i)"). How would I ignore repeated characters ? e.g heeeello.

Was it helpful?

Solution

There is no flag that you can turn on to simply ignore any duplicate characters. However, you can use the 'one or more' quantifier (+) to match one or more occurrence of any character, character class, or group. For example the pattern he+l+o will match all of the following:

  • helo
  • heelo
  • hello
  • heeeello

OTHER TIPS

Assuming you want a general solution to remove repeated characters, you'll want to replace (.)\1 with \1 repeatedly as long as it succeeds.

Use + to catch as many repetition of a sequence in () as there are. e+ will catch all the e's in heeeeello.

Check out rubular.com, very simple way to learn, practice and test regex.

You need to capture a single character then check for any repetition of it with using a backreference to the lately captured group:

(.)\1+

If string is matched then it has repetition.

Live demo

The problem is harder than you think. Let's assume that you want to match "no fewer than this number of characters" for each word in your dictionary. Then you would have to create a dictionary of regexes with a + after each character…

Initial dictionary:
boom
smurf
tree
cannibals

Process the dictionary with a text editor:

sed -e 's/\(.\)/\1\+/g' dictionary.txt > regex.txt

This puts a + between all characters:

b+o+o+m+
s+m+u+r+f+
t+r+e+e+
c+a+n+n+i+b+a+l+s+

And now you can match your "repeated" words:

bom : no match
smuuurf : match
trees   : no match
canibals : no match
cannnibalssss : match

You might want to add "word boundaries" - so that smurfette doesn't get caught by smurf. This would mean adding \b before and after each expression ("word boundary").

Note - it's not enough to remove all duplicate letters from both the dictionary, and the words to be matched - otherwise you risk banning "pop" because you had "poop" on your list (and how would you know to stop when "pooop" had reached exactly two characters). This is why I prefer this solution over some of the others that recommend stripping repeats.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top