Question

I need some help with looping, or a better way to go about this. The answer may be obvious, but I'm new here and feel a mental block right now: I have a log file that looks like this and I am trying to match all lines with the same ID: so I can later compare the values of matched ID's. I am able to match the first lines, but then my loop seems to terminate. I am not sure what I'm doing wrong, or if there is a better approach altogether. Any help is much appreciated!

Some notes:

  • when I split the lines, the XYZ ID column is indexed at line[2], where len(line) == 11.
  • I am trying to loop through the file and for each line, create an inner loop which scans the remaining lines of the file to find a 'match'.
  • If a match is found, I want to return this so I can compare values
  • The trouble is my code seems to break after the first match is found, thus returning only the first match found

below is my code and a sample of the log file that I'm working with (includes some edited strings just to keep some business data private). The actual logfile includes commas, which were removed before I pasted into this forum:

f = open('t.log','r')
for line in f:
    aline = line.replace(',','').split()
    if len(aline)==11:
        for line in f:
            bline = line.replace(',','').split()
            if len(bline)==11 and aline[2]==bline[2]:
                print 'a: ', aline
                print 'b: ', bline

#t.log

[13:40:19.xxx009] status    -------             
[13:40:19.xxx013] status    XYZ -4  -5675.36     quote  449.70/- 449.78 avg 1418.84 -7474.48       0.134     -55.630    -395.148    
[13:40:19.xxx021] status    XYZ  ID:22P00935xxx -4  3.92     quote:    0.98/   1.02  avg:   -0.98   -0.16
[13:40:19.xxx024] status    XYZ  ID:22C0099xxx0 -2  26.4     quote:   11.60/  11.85  avg:  -13.20    2.70
[13:40:19.xxx027] status    XYZ  ID:22P0099xxx0 10  -17.18   quote:    1.86/   1.90  avg:   -1.72    1.42
[13:40:19.xxx029] status    XYZ  ID:22C00995xxx 4   -42.5    quote:    8.20/   8.30  avg:  -10.62   -9.70
[13:40:19.xxx031] status    XYZ  ID:22P00995xxx 2   9.66     quote:    3.30/   3.40  avg:    4.83   16.26
[13:40:19.xxx535] status    total xx5.52                

[13:41:20.xxx688] status    -------             
[13:41:20.xxx691] status    XYZ -4  -5675.36     quote  449.83/- 449.99 avg 1418.84 -7475.32      -0.374    -213.006     -39.391    
[13:41:20.xxx701] status    XYZ  ID:22P00935xxx -4  3.92     quote:    0.96/   1.00  avg:   -0.98   -0.08
[13:41:20.xxx704] status    XYZ  ID:22C0099xxx0 -2  26.4     quote:   11.65/  11.90  avg:  -13.20    2.60
[13:41:20.xxx708] status    XYZ  ID:22P0099xxx0 10  -17.18   quote:    1.83/   1.87  avg:   -1.72    1.12
[13:41:20.xxx712] status    XYZ  ID:22C00995xxx 4   -42.5    quote:    8.20/   8.30  avg:  -10.62   -9.70
[13:41:20.xxx716] status    XYZ  ID:22P00995xxx 2   9.66     quote:    3.30/   3.35  avg:    4.83   16.26
[13:41:20.xxx718] status    XYZ  ID:22C0095xxx0 -10 35.6     quote:    5.40/   5.50  avg:   -3.56  -19.40
[13:41:20.001362] status    total xx6.68    

Result:    
$ python pnlcomp.py
    a:  ['[13:40:19.000021]', 'statusAAPL', '130322P00435000', '-4', '3.92', 'quote:', '0.98/', '1.02', 'avg:', '-0.98', '-0.16']
    b:  ['[13:41:20.000701]', 'statusAAPL', '130322P00435000', '-4', '3.92', 'quote:', '0.96/', '1.00', 'avg:', '-0.98', '-0.08']
Était-ce utile?

La solution

you should probably use regular expressions (also called regex) for that. Python has the re module which implements regex for python.

See this as an example for the direction to look at: stackoverflow question finding multiple matches in a string.

Excerpt from the above: Logfile looks like:

[1242248375] SERVICE ALERT: myhostname.com;DNS: Recursive;CRITICAL

regex looks like:

regexp = re.compile(r'\[(\d+)\] SERVICE NOTIFICATION: (.+)')

which goes like this:

  • r => raw string (alway recommended in regexes)
  • \[ => matches the square bracket (which would be a special character otherwise)
  • (\d+) => matches one ore more decimals \d = decimals and the + for 1 or more
  • \] => followed by a closing square bracket
  • SERVICE NOTIFICATION: => matches exactly these characters in sequence.
  • (.+) => the . (dot) matches any character. And again the + means 1 or more

Parantheses group the results.

I made a short regex to start with your logfile format. Assuming your log from above is saved as log.txt.

import re
regexp = re.compile(r'\[(\d{2}:\d{2}:\d{2}\.xxx\d{3})\][\s]+status[\s]+XYZ[\s]+ID:([0-9A-Zx]+)(.+)')

f = open("log.txt", "r")
for line in f.readlines():
    print line
    m = re.match(regexp, line)
    #print m
    if m:
        print m.groups()

Regexes are not that easy looking or straightforward at first glance but if you search for regex or re AND python you will find helpful examples.

Outpus this for me:

[13:40:19.xxx021] status    XYZ  ID:22P00935xxx -4  3.92     quote:    0.98/   1.02  avg:   -0.98   -0.16

('13:40:19.xxx021', '22P00935xxx', ' -4  3.92     quote:    0.98/   1.02  avg:   -0.98   -0.16')
[13:40:19.xxx024] status    XYZ  ID:22C0099xxx0 -2  26.4     quote:   11.60/  11.85  avg:  -13.20    2.70

('13:40:19.xxx024', '22C0099xxx0', ' -2  26.4     quote:   11.60/  11.85  avg:  -13.20    2.70')
[13:40:19.xxx027] status    XYZ  ID:22P0099xxx0 10  -17.18   quote:    1.86/   1.90  avg:   -1.72    1.42

('13:40:19.xxx027', '22P0099xxx0', ' 10  -17.18   quote:    1.86/   1.90  avg:   -1.72    1.42')
[13:40:19.xxx029] status    XYZ  ID:22C00995xxx 4   -42.5    quote:    8.20/   8.30  avg:  -10.62   -9.70

('13:40:19.xxx029', '22C00995xxx', ' 4   -42.5    quote:    8.20/   8.30  avg:  -10.62   -9.70')
[13:40:19.xxx031] status    XYZ  ID:22P00995xxx 2   9.66     quote:    3.30/   3.40  avg:    4.83   16.26
('13:40:19.xxx031', '22P00995xxx', ' 2   9.66     quote:    3.30/   3.40  avg:    4.83   16.26')

Every second line is the output which is a list containing the matched groups.

If you add this to the programm above:

print "ID is : ", m.groups()[1]

the output is:

[13:40:19.xxx021] status    XYZ  ID:22P00935xxx -4  3.92     quote:    0.98/   1.02  avg:   -0.98   -0.16

ID is :  22P00935xxx

[13:40:19.xxx024] status    XYZ  ID:22C0099xxx0 -2  26.4     quote:   11.60/  11.85  avg:  -13.20    2.70

ID is :  22C0099xxx0

Which matches your IDs you want to compare. Just play with it a little to get the result you really want.

Final example catches the ID, tests if its already there and adds the matched lines to a dictionary which has te IDs as its key:

import re regexp = re.compile(r'[(\d{2}:\d{2}:\d{2}.xxx\d{3})][\s]+status[\s]+XYZ[\s]+ID:([0-9A-Zx]+)(.+)')

res = {}

f = open("log.txt", "r")
for line in f.readlines():
    print line
    m = re.match(regexp, line)  
    if m:
        print m.groups()
        id = m.groups()[1]
        if id in res:
            #print "added to existing ID"
            res[id].append([m.groups()[0], m.groups()[2]])
        else:
            #print "new ID"
            res[id] = [m.groups()[0], m.groups()[2]]

for id in res:
    print "ID: ", id
    print res[id]

Now you can play around and fine tune it to adapt it to your needs.

Autres conseils

You could use the filter function to get any line with "ID" in it.

file = open('t.log', 'r')
result = filter(lambda s: "ID" in s, file)

You could also use a list comprehension:

file = open('t.log', 'r')
result = [s for s in file if 'ID' in s]

This probably isn't the best way to solve your problem, but if you want to know how to make it work:

The problem here is that your inner for line in f: loop consumes the whole rest of the file—so when you get back to the outer loop, there's nothing left to read. (There's a second problem: When I run your code on your data, len(aline) is always 12, not 11. But that's a trivial fix.)

This isn't specific to files; it's how all iterators work in Python. There are two general ways to deal with this for any iterator, plus one file-specific solution.

First, there's itertools.tee. This takes an iterator, and returns two iterators, each of which can be advanced independently. Under the covers, it obviously has to use some storage to handle things if they get out of sync, which is why the documentation says this:

In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

And that's the other option: Read the whole iterator into a list, so you can loop over slices.

This is clearly one of those cases where one iterator uses most of the data while the other one's sitting around waiting. For example, the first time through the inner loop, you're reading lines 1-20000 before the outer loop reads line 1. So, a list is a better option here. So:

f = open('t.log','r')
contents = list(f)
f.close()
for idx, line in enumerate(contents):
    aline = line.replace(',','').split()
    if len(aline)==11:
        for line in contents[idx+1:]:
            bline = line.replace(',','').split()
            if len(bline)==11 and aline[2]==bline[2]:
                print 'a: ', aline
                print 'b: ', bline

Finally, if you have an fancy iterator that can be checkpointed and resumed in some way, you can checkpoint it right before the inner loop, then resume it right after. And fortunately, files happen to have such a thing: tell returns the current file position, and seek jumps to a specified position. (There's a big warning saying that "If the file is opened in text mode (without 'b'), only offsets returned by tell() are legal." But that's fine; you're only using offsets returned by tell here.)

So:

f = open('t.log','r')
for line in f:
    aline = line.replace(',','').split()
    if len(aline)==11:
        pos = f.tell()
        for line in f:
            bline = line.replace(',','').split()
            if len(bline)==11 and aline[2]==bline[2]:
                print 'a: ', aline
                print 'b: ', bline
        f.seek(pos)
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top