This should fit your need, I think:
import re
pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
'.*nasdaq.*\n'\
'(?:^(?![\t ]*\r?\n).+\n?)*'
filename = 'to_define.txt'
with open(filename,'rb+') as f:
content = f.read()
f.seek(0,0)
f.write(re.sub(pat,'',content,flags=re.M))
f.truncate()
It works only if sections are really separated with at least a void line (it may be a line '\n' or a line ' \t \n' with blanks and tabs, it doesn't matter)
.
'(?:^(?![ \t]*\r?\n).+\n)*?'\
'.*nasdaq.*\n'\
'(?:^(?![ \t]*\r?\n).+\n?)*'
[\t ]
means a character that can be either a tab or a blank
[\t ]*
means a character, that can be either a tab or a blank, is repeated 0 or more times
(?!
begins an negative lookahead assertion
(?=
begins a positive lookahead assertion
(?![\t ]*\r?\n)
means there must not be the following sequence after this position: a succession of zero or more 'blank or tab' , a character \r (that may be absent) and the character newline \n
When I employ the word 'position' it means the location between two characters.
An assertion means something from the position it is placed.
In the above RE, the negative lookahead assertion is placed after the symbol ^
which means position before the first character of a line.
So the above assertion, as it is placed, means: from the position situated at the beginning of a line, there must not be a sequence 0 or more tab/blank-potential \r-\n
.
Note that the symbol ^
means "beginning of a line" only if the flag re.MULTILINE
is activated.
Now the partial RE (?! *\r?\n)
is situated inside the following RE :
(?:^.+\n)*?
Normally, (...)
defines a capturing group.
The consequence of puting ?:
at the beginning between parens is that these parens no more define a capturing group. But (?:......)
is usefull to define a RE.
Here .+\n
means a succession of any character (except \n
) and then a \n
.
And ^.+\n
(with flag re.M
activated) means from the beginning of a line, the succession of any character except a newline and a newline
Note that, as a dot .
matches with any character except \n
, we are sure that .+
can't matches with a sequence going beyond the end of the line which is signaled by \n
.
Then ^.+\n
defines a line in fact !
Now what we have ?
There's a *
after the uncatching group. It means that the substrings matching (?:^.+\n)
are repeated 0 or more times: that is to say we match a succession of lines.
But not any line, since there's the negative lookahead assertion, which you now know the signification of.
So, what is matched by the RE (?:^(?![\t ]*\r?\n).+\n)*
is : a succession of lines among which there is no void line. A void line being either \n
or `\t\t\n
or \t \t \n
etc (I can't represent a line with only blanks in it , on srackoverflow, but it's also void line)
The question mark at the end of this RE means that the progression of the regex motor that matches such non-void lines one after the other must STOP as soon as it encounters the following RE.
And the following RE is .*nasdaq.*\n
which means a line in which there is the word 'nasdaq'
There are some more subtleties but I will stop here.
I think the rest will also be more understandble for you.
.
EDIT
In case a section would be the last one and its last line would have nasdaq in it, it wouldn't be catched and deleted by the above regex.
To correct this, the part .*nasdaq.*\n
must be replaced with .*nasdaq.*(\n|\Z)
in which \Z
means the very end of the string.
I also added a part to the regex to catched the void lines after each section, so the file is cleaned of these lines.
pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
'.*?nasdaq.*(\n|\Z)'\
'(?:^(?![\t ]*\r?\n).+\n?)*'\
'(?:[\t ]*\r?\n)*'