finding sub-segment of list of words in any order using regular expression
-
16-06-2021 - |
Question
I have a list of words and i have to find the minimum word sub-segment of words, neglecting any special characters and digits, this sub-segment can have words in any order and case-insensitive search and this coding is to be done in java
for e.g.
aaaa aaaa cccc cccc bbbb bbbb bbbb bbbb Bbbb Aaaa Cccc
and have words
aaaa
bbbb
cccc
then it should give me output as
Bbbb Aaaa Cccc
i am aware about regular expression (regex in java) but, new to it so any help would be of great use..
Solution
What you could do is to construct a regex like this:
(?i)\b(aaaa|bbbb|cccc)(?=\W+(\w+)\W+(\w+)\b)
\__/ \_____________/ \______/ \__ makes sure it's a complete word
| | \____ repeat N-1 times (N = number of words)
| \___ all words alternated to match the first word
\__ case insensitive matching
Then with Java check that the capturing groups contain all the words. If they do you found a match, if not, search for next match and repeat.
You could take this all the way to be solved with regex only, but you have to construct the proper expression:
(?i)\b(words)\W+(?!\1\b)(words)\W+(?!(?:\1|\2)\b)(words)\b
\___/ \________________/ \_____________/
| | |
list of all the | lookahead has to include
words alternated | all previous capturing groups
|
repeat N-1 times but you have to
change the lookahead each time
This would be a pretty big expression for many words, altho words
can be any expression that matches all allowed words (doesn't have to be alternations).