Question

I have a pretty specific problem. I am trying to delete certain lines out of a server configuration file based on a keyword find. If you scroll down the code below at the bottom, I am trying to delete the block of code that has the keyword "nasdaq" in the directory line. This includes everything from the "database" line all the way to the bottom where it reads "index termName pres, eq".

What is the best way I can go about this? String.find()? What commands should I use to delete lines above and below the keyword line?

Also, I could either delete the lines or just write to a new file and ignore the last block. Some guidance needed!

include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/core.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/cosine.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/inetorgperson.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/tuatara.schema
pidfile         /home/tuatara/TuataraServer-2.0/var/slapd.pid
argsfile        /home/tuatara/TuataraServer-2.0/var/slapd.args

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker"
suffix          "dc=CMDB-spellchecker,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker.medicinenet-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker.medicinenet"
suffix          "dc=CMDB-spellchecker.medicinenet,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker.medicinenet"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-nasdaq-20131127-12_37_43_PM
suffix          "o=CMDB-nasdaq"
suffix          "dc=CMDB-nasdaq,dc=com"
rootdn          "cn=admin,o=CMDB-nasdaq"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     100000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq
Was it helpful?

Solution 2

This should fit your need, I think:

import re

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*nasdaq.*\n'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'

filename = 'to_define.txt'

with open(filename,'rb+') as f:
    content = f.read()
    f.seek(0,0)
    f.write(re.sub(pat,'',content,flags=re.M))


    f.truncate()

It works only if sections are really separated with at least a void line (it may be a line '\n' or a line ' \t \n' with blanks and tabs, it doesn't matter)

.

'(?:^(?![ \t]*\r?\n).+\n)*?'\
'.*nasdaq.*\n'\
'(?:^(?![ \t]*\r?\n).+\n?)*'

[\t ] means a character that can be either a tab or a blank
[\t ]* means a character, that can be either a tab or a blank, is repeated 0 or more times
(?! begins an negative lookahead assertion
(?= begins a positive lookahead assertion
(?![\t ]*\r?\n) means there must not be the following sequence after this position: a succession of zero or more 'blank or tab' , a character \r (that may be absent) and the character newline \n
When I employ the word 'position' it means the location between two characters.
An assertion means something from the position it is placed.
In the above RE, the negative lookahead assertion is placed after the symbol ^ which means position before the first character of a line.
So the above assertion, as it is placed, means: from the position situated at the beginning of a line, there must not be a sequence 0 or more tab/blank-potential \r-\n.
Note that the symbol ^ means "beginning of a line" only if the flag re.MULTILINE is activated.

Now the partial RE (?! *\r?\n) is situated inside the following RE :
(?:^.+\n)*?
Normally, (...) defines a capturing group.
The consequence of puting ?: at the beginning between parens is that these parens no more define a capturing group. But (?:......) is usefull to define a RE.

Here .+\n means a succession of any character (except \n) and then a \n.

And ^.+\n (with flag re.M activated) means from the beginning of a line, the succession of any character except a newline and a newline
Note that, as a dot . matches with any character except \n, we are sure that .+ can't matches with a sequence going beyond the end of the line which is signaled by \n.
Then ^.+\n defines a line in fact !

Now what we have ?
There's a * after the uncatching group. It means that the substrings matching (?:^.+\n) are repeated 0 or more times: that is to say we match a succession of lines.

But not any line, since there's the negative lookahead assertion, which you now know the signification of.
So, what is matched by the RE (?:^(?![\t ]*\r?\n).+\n)* is : a succession of lines among which there is no void line. A void line being either \n or `\t\t\n or \t \t \n etc (I can't represent a line with only blanks in it , on srackoverflow, but it's also void line)

The question mark at the end of this RE means that the progression of the regex motor that matches such non-void lines one after the other must STOP as soon as it encounters the following RE.
And the following RE is .*nasdaq.*\n which means a line in which there is the word 'nasdaq'

There are some more subtleties but I will stop here.
I think the rest will also be more understandble for you.

.

EDIT

In case a section would be the last one and its last line would have nasdaq in it, it wouldn't be catched and deleted by the above regex.
To correct this, the part .*nasdaq.*\n must be replaced with .*nasdaq.*(\n|\Z) in which \Z means the very end of the string.

I also added a part to the regex to catched the void lines after each section, so the file is cleaned of these lines.

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*?nasdaq.*(\n|\Z)'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'\
      '(?:[\t ]*\r?\n)*'

OTHER TIPS

As was already mentioned, sed is built for this kind of stuff, but you could do it in python with something like this:

with open('nasdaq.txt') as fin, open('nonasdaq.txt', 'w') as fout:
    for line in fin:
        if 'nasdaq' not in line:
            fout.write(line)

All it does is loop over the lines of the input file, and copies them to the output file if they don't contain the string 'nasdaq'.

with open('nasdaq.txt','r') as f:
    text = [l for l in f.read().splitlines()]

text = text[9:] # get rid of include headers
n = 20 # yours chunks are about this size

# sort chunks into list of lists
groups = []
for i in range(0, len(text), n):
    chunk = text[i:i+n]
    groups.append(chunk)

# get rid of unwanted lists by keyword
for ind,g in enumerate(groups):
    if any('nasdaq' in x for x in g):
        toss = groups.pop(ind)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top