Question

I have a file like

HEADER foo bar
garbage
SUBHEADER foo foo bar
other garbage
SUBHEADER foo foo bar bar
HEADER foo baz
SUBHEADER foo bar foo foo
SUBHEADER foo foo foo foo
SOMETHING bar bar bar
HEADER baz baz
SUBHEADER baz bar baz foo

where the capitalized words occur literally and want to find SOMETHING together with the corresponding HEADER and SUBHEADER, i.e.,

HEADER foo baz
SUBHEADER foo foo foo foo
SOMETHING bar bar bar

It's rather trivial in a program, but can a regex do? I could imagine a solution using negative assertions, but that get very unreadable.

Was it helpful?

Solution

If you're looking for the nearest prior HEADER and SUBHEADER before the SOMETHING, then I think you just want non-greedy matching in your regex--assuming you have a regex processor that will match multiple lines at once, which generally rules out grep, sed, and similar.

For example, something like this:

(^HEADER.*?$).*?(^SUBHEADER.*?$).*?(^SOMETHING.*?$)

I'm also assuming that '.' does match newlines (as in PCRE_DOTALL mode), and that '^'/'$' will match beginning/end-of-line in the middle of the string (as in PCRE_MULTILINE mode). These are configurable options in many regex implementations.


edit: I've modified the command you laid out in your comment and gotten it to work.

perl -0777 -ne '/.*(^HEADER.*?\n).*(^SUBHEADER.*?\n).*?(^SOMETHING.*?\n)/ms
  and print "$1$2$3*\n"'

(I added the 'm' flag and re-added beginning-of-line anchors for paranoia's sake; you can take them back out if you want.)

The key idea turned out to be placing a greedy match-all pattern at the beginning, giving the regular expression matcher permission to match HEADER as late as possible. I'd have expected an un-anchored match like this to act as if it had an implicit greedy match at the beginning, but apparently in the presence of non-greedy operators it doesn't work that way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top