Question

I actually have:

 regex = r'\bon the\b'

but need my regex to match only if this keyword (actually "on the") is not between parentheses in the text:

should match:

john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)

should not match:

(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)
Was it helpful?

Solution

I don't think that regex would help you here for a general case. for your examples, this regex would work as you want it to:

((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])

description:

(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below 
                 can be matched
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally
    .{3} matches any character (except newline)
        Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below 
                can be matched
    .{3} matches any character (except newline)
        Quantifier: Exactly 2 times
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally

if you want to generalize the problem to any string between the parentheses and the string you are searching for, this will not work with this regex. the issue is the length of that string between parentheses and your string. In regex the Lookbehind quantifiers are not allowed to be indefinite.

In my regex I used positive Lookahead and positive Lookbehind, the same result could be achieved as well with negative ones, but the issue remains.

Suggestion: write a small python code which can check a whole line if it contain your text not between parentheses, as regex alone can't do the job.

example:

import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
    for item in unWanted:
        if item in line:
            mylist.remove(line)
# look for what you want
for line in mylist:
    if mystr in line:
        print line

where:

mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.

Hope this helped.

OTHER TIPS

In UNIX, grep utility using the following regular expression will be sufficient,

grep " on the " input_file_name | grep -v "\(.* on the .*\)"

How about something like this: ^(.*)(?:\(.*\))(.*)$ see it in action.

As you requested, it "matches only words that are not between parentheses in the text"

So, from:

some text (more text in parentheses) and some not in parentheses

Matches: some text + and some not in parentheses

More examples at the link above.


EDIT: changing answer since the question was changed.

To capture all mentions not within parentheses I'd use some code instead of a huge regex.

Something like this will get you close:

import re

pattern = r"(on the)"

test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''

match_list = test_text.split('\n')

for line in match_list:
    print line, "->",

    bracket_pattern = r"(\(.*\))" #remove everything between ()
    brackets = re.findall(bracket_pattern, line)
    for match in brackets:
        line = line.replace(match,"")

    matches = re.findall(pattern, line)
    for match in matches:
        print match

    print "\r"

Output:

john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach -> 
bob is at the pool (berkeley) -> 
the spon (is on the table) -> 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top