Question

I have a very large file with wrong informations.

  • this one
  • is the
  • xxx 123gt few 1121
  • 12345 fre 233fre
  • problematic file.
  • It contains
  • xxx hy 456 efe
  • rtg 1215687 fwe
  • many errors
  • That I'd like
  • toget rid of

I wrote a script. Whenever xxx is encountered:

  1. The line is replaced with a custom string (something).
  2. The very next line is replaced with another custom string (stg).

Here is the script:

subject='problematic.txt'
pattern='xxx'
subject2='resolved.txt'
output = open(subject2, 'w')
line1='something'
line2='stg'


with open(subject) as myFile:
    for num, line in enumerate(myFile, 1): #to get the line number
        if pattern in line:
            print 'found at line:', num
            line = line1 #replace the line containing xxx with 'something'
            output.write(line)
            line = next(myFile, "") # move to the next line
            line = line2 #replace the next line with 'stg'
            output.write(line)
        else:
            output.write(line) # save as is
output.close()
myFile.close()

It works well with the first xxx occurrence, but not with the subsequents. The reason comes from next() that moves forward the iteration thus my script makes changes at wrong places.

Here is the output:

found at line: 3

found at line: 6

instead of :

found at line: 3

found at line: 7

Consequently the changes are not made in the write place... Ideally, canceling next() after I changed the line with line2 would solve my problem, but I didn't find a previous() function. Anyone? Thanks!!

Was it helpful?

Solution 2

When you think you need to look ahead, it is almost always simpler to restate the problem in terms of looking back. In this case, just keep track of the previous line and look at that to see if it matches your target string.

infilename  = "problematic.txt"
outfilename = "resolved.txt"

pattern  = "xxx"
replace1 = "something"
replace2 = "stg"

with open(infilename) as infile:
    with open(outfilename, "w") as outfile:

        previous = ""

        for linenum, current in enumerate(infile):
            if pattern in previous:
                print "found at line", linenum
                previous, current = replace1, replace2
            if linenum:           # skip the first (blank) previous line
                outfile.write(previous)
            previous = current

        outfile.write(previous)    # write the final line

OTHER TIPS

Your current code almost works. I believe that it correctly identifies and filters out the right lines of your input file, but it reports the line numbers it finds the matches at incorrectly, since the enumerate generator doesn't see the skipped lines.

Though you could rewrite it in various ways as the other answers suggest, you don't need to make major changes (unless you want to, for other design reasons). Here's the code with the minimal changes needed pointed out by new comments:

with open(subject) as myFile:
    gen = enumerate(myFile, 1)  # save the enumerate generator to a variable
    for num, line in gen:       # iterate over it, as before
        if pattern in line:
            print 'found at line:', num
            line = line1
            output.write(line)
            next(gen, None)     # advance the generator and throw away the results
            line = line2
            output.write(line)
        else:
            output.write(line)

This seems to work with the string to be replaced appearing both at odd and even line numbers:

with open ('test.txt', 'r') as f:
    for line in f:
        line = line.strip ()
        if line == 'apples': #to be replaced
            print ('manzanas') #replacement 1
            print ('y más manzanas') #replacement 2
            next (f)
            continue
        print (line)

Sample input:

apples
pears
apples
pears
pears
apples
pears
pears

Sample output:

manzanas
y más manzanas
manzanas
y más manzanas
pears
manzanas
y más manzanas
pears

There is no previous function because that's not how the iterator protocol works. Especially with generators, the concept of a "previous" element may not even exist.

Instead you want to iterate over your file with two cursors, zipping them together:

from itertools import tee

with open(subject) as f:
    its = tee(f) 
    next(its[1]) # advance the second iterator to first line
    for first,second in zip(*its): # in python 2, use itertools.izip
        #do something to first and/or second, comparing them appropriately

The above is just like doing for line in f:, except you now have your first line in first and the line immediately after it in second.

I would just set a flag to indicate that you want to skip the next line, and check for that in the loop instead of using next:

with open(foo) as myFile: 
  skip = False
  for line in myFile:
    if skip:
      skip = False
      continue
    if pattern in line:
      output.write("something")
      output.write("stg")
      skip = True
    else:
      output.write(line)        

You need to buffer the lines in some way. This is easy to do for a single line:

class Lines(object):

    def __init__(self, f):
        self.f = f        # file object
        self.prev = None  # previous line

    def next(self):
        if not self.prev:
            try:
                self.prev = next(self.f)
            except StopIteration:
                return
        return self.prev

    def consume(self):
        if self.prev is not None:
        self.prev = next(self.f)

Now you need to call Lines.next() to fetch the next line, and Lines.consume() to consume it. A line is kept buffered until it is consumed:

>>> f = open("table.py")
>>> lines = Lines(f)
>>> lines.next()
'import itertools\n'
>>> lines.next()      # same line
'import itertools\n'
>>> lines.consume()   # remove the current buffered line
>>> lines.next()
'\n'                  # next line

you can zip lines this way to get both pointers at once:

with open(subject) as myFile:
    lines = myFile.readlines()
    for current, next in zip(lines, lines[1:])
         ...

edit: this is just to demonstrate the idea of zipping the lines, for big files use iter(myFile), meaning:

with open(subject) as myFile:
    it1 = myFile
    myFile.next()
    for current, next in zip(it1,myFile):
        ...

note that file is iterable, no need to add any extra wrapping to it

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top