Extract the part enclosed by a predefined multiline character sequence

https://stackoverflow.com/questions/11156743

16-06-2021
|

Question

Hope the AWK gurus can provide a solution to my problem .

I have a file that goes like this :

cat cat cat cat cat cat dog rat ate dog tit 
dog cat dog dog dog rat dog pat ate cat dog

I have to use AWK to extract the pattern between the first occuring c and a d .Starting from the first c a count should be kept on the number of c's and d's such that when the count matches , the part between the first c and the matched d shoud be ouput to a file including the number of the line in which the match for d occured .

In this particular example the match occurs on the seventh dog , therefore the output will have to be :

cat cat cat cat cat cat dog rat ate dog tit 
dog cat dog dog dog rat d

The match can go beyond just two lines ! The output can or cannot be inclusive of the c and the d .There exists all kinds of characters inclusive of the special ones in the text ! In order for the print to occur the count has to be matched .

Thanks in advance for the replies. Suggestions are always welcome .

EDIT : The capture of the pattern between c and d can be compromised as long as the condition is met and the line number of the exit d is obtained :)

Solution

A few tips, without giving the full solution:

By default, awk considers each line as a record. The default record separator is RS="\n".

Depending on your version of awk, you may be able to set RS, the record separator, to a regex which matches either c or d. Then, for each record, you can check the RT variable, which will contain either c or d, depending on what has actually been matched. Starting from there, using a variable incremented on c, decremented on d you will be able to find the end of the match when it reaches 0.

You can then use a variable that contains your match so far, and keep concatenating RT and the new record to it, until you're done.

If you need to know the line number of the end of the match, you can set RS to a regex that either matches c, d, as previously, but also add the possibility to match \n. And by maintaining another counter variable incremented every time RT tells you that \n has been matched, you'll have your line number.

OTHER TIPS

Here's a sed solution just for fun:

sed -rne ':r;$!{N;br};s/^[^c]*(.*d)[^d]*$/\1/;:a;h;s/[^cd]//g;' \
-e ':s;s/d(.*)c/c\1d/;ts;s/cd/c\nd/;T;y/c/d/;/^(d+)\n\1$/{g;i -------' \
-e 'p};g;s/d[^d]*d$/d/;ta'

This prints all satisfying sequences from longest to shortest.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow