سؤال

This is a crossword problem. Example:

  • the solution is a 6-letter word which starts with "r" and ends with "r"
  • thus the pattern is "r....r"
  • the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
  • each letter must be used exactly once
  • we have a large list of candidate 6-letter words

Solutions: "rapier" or "repair".

Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.

Is this problem amenable to a regex, or must it be done by exhaustive methods?

هل كانت مفيدة؟

المحلول

Try this:

r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r

...or more readably:

r
(?:
  (?!\1) a () |
  (?!\2) e () |
  (?!\3) i () |
  (?!\4) p ()
){4}
r

The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.

What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:

r
(?:
  (?!\1) e () |
  (?!\2) e () |
  (?!\3) e () |
  (?!\4) d ()
){4}
m

After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...

Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.

The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.

نصائح أخرى

Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:

/r[aeip]{4,4}r/

What's the front end language being used to compare strings, is it java, .net ...

here is an example/psuedo code using java

    String mandateLetters = "aeio"
    String regPattern = "\\br["+mandateLetters+"]*r$";  // or if for specific length \\br[+mandateLetters+]{4}r$

    Pattern pattern = Pattern.compile(regPattern);
    Matcher matcher = pattern.matcher("is this repair ");

    matcher.find();

Why not replace each '.' in your original pattern with '[aeip]'?

You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.

This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.

This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.

So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:

r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r

Still quite ugly, but here is how it works:

r     # match an r
(?=   # positive lookahead (doesn't advance position of "cursor" in input string)
  [aeip]{4,4}
)     # make sure that there are the four desired character ahead
(.)   # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.)   # match any character and capture it in group 2
(?!\1|\2)
      # make sure that the next character is neither the first nor the second
(.)   # match any character and capture it in group 3
(?!\1|\2|\3)
      # same thing again for all three characters
.     # match another arbitrary character
r     # match an r

Working demo.

This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.

EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:

r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r

Explained:

r       # match an r
(?=     # lookahead: ensure that there is exactly one a until the next r
  [^a]* # match an arbitrary amount of non-a characters
  a     # match one a
  [^a]* # match an arbitrary amount of non-a characters
  r     # match the final r
)       # end of lookahead
(?=[^e]*e[^e]*r)  # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r)  # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r)  # ensure that there is exactly one p until the next r
[aeip]{4,4}r      # actually match the rest to include it in the result

Working demo.

For r....m with a pool of deee, this could be adjusted as:

r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m

This ensures that there is exactly one d and exactly 3 es.

Working demo.

not fully regex due to sed multi regex action

sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile

take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.

A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:

^r(?:([aeip])(?!.*\1)){4}r$

more readable as:

^r
(?:
  ([aeip])
  (?!.*\1)
){4}
r$

Improvements

This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:

  • If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:

    ^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
    

    (may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)

  • If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):

    ^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
    

    The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total). BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top