The problem is harder than you think. Let's assume that you want to match "no fewer than this number of characters" for each word in your dictionary. Then you would have to create a dictionary of regexes with a +
after each character…
Initial dictionary:
boom
smurf
tree
cannibals
Process the dictionary with a text editor:
sed -e 's/\(.\)/\1\+/g' dictionary.txt > regex.txt
This puts a +
between all characters:
b+o+o+m+
s+m+u+r+f+
t+r+e+e+
c+a+n+n+i+b+a+l+s+
And now you can match your "repeated" words:
bom : no match
smuuurf : match
trees : no match
canibals : no match
cannnibalssss : match
You might want to add "word boundaries" - so that smurfette
doesn't get caught by smurf
. This would mean adding \b
before and after each expression ("word boundary").
Note - it's not enough to remove all duplicate letters from both the dictionary, and the words to be matched - otherwise you risk banning "pop" because you had "poop" on your list (and how would you know to stop when "pooop" had reached exactly two characters). This is why I prefer this solution over some of the others that recommend stripping repeats.