(python) find matching loglines

Question 1

The nested loop approach means the algorithm is of O(N^2), even if the inner starting index is made more efficient. Here is an example of an on average O(N) approach which does not use a nested loop.

It also tries to handle some cases of unmatched transactions, assuming that a log-on of a user must be followed by another log-off by that user before he / she ever logs in again.

log_lines =[('2014-01-28 16:54:58', 'LOGON', 'jane', 'machinename'),
('2014-01-28 17:50:18', 'LOGOFF', 'jane', 'machinename'),
('2014-01-28 19:53:02', 'LOGON', 'skip', 'machinename'),
('2014-01-28 19:54:12', 'LOGOFF', 'skip', 'machinename'),
('2014-01-29 09:41:52', 'LOGON', 'jim', 'machinename'),
('2014-01-29 09:42:45', 'LOGOFF', 'jim', 'machinename'),
('2014-01-29 11:59:20', 'LOGON', 'skip', 'machinename'),
('2014-01-29 12:00:52', 'LOGOFF', 'skip', 'machinename'),
# Following are made up, weird logs
('2014-01-29 12:00:52', 'LOGOFF', 'dooz', 'machinename'),
('2014-01-29 12:00:52', 'LOGOFF', 'booz', 'machinename'),
('2014-01-29 12:00:52', 'LOGON', 'fooz', 'machinename'),]

from pprint import pprint

logged_in = {}
transactions_matched = []
transactions_weird = []
for line in log_lines:
    action = line[1]
    user = line[2]
    if action == 'LOGON':
        if user not in logged_in:
            logged_in[user] = line
        else: # Abnormal case 1: LOGON again when the user is already LOGON
            transactions_weird.append(logged_in.pop(user))
            logged_in[user] = line
    elif action == 'LOGOFF':
        if user in logged_in:
            transactions_matched.append((logged_in.pop(user), line))
        else: # Abnormal case 2: LOGOFF when the user is never LOGIN yet
            transactions_weird.append(line)

# Dangling log-in actions, considered as abnormal
transactions_weird.extend(logged_in.values())          

print 'Matched:'
pprint(transactions_matched)
print 'Weird:'
pprint(transactions_weird)

Output:

Matched:
[(('2014-01-28 16:54:58', 'LOGON', 'jane', 'machinename'),
  ('2014-01-28 17:50:18', 'LOGOFF', 'jane', 'machinename')),
 (('2014-01-28 19:53:02', 'LOGON', 'skip', 'machinename'),
  ('2014-01-28 19:54:12', 'LOGOFF', 'skip', 'machinename')),
 (('2014-01-29 09:41:52', 'LOGON', 'jim', 'machinename'),
  ('2014-01-29 09:42:45', 'LOGOFF', 'jim', 'machinename')),
 (('2014-01-29 11:59:20', 'LOGON', 'skip', 'machinename'),
  ('2014-01-29 12:00:52', 'LOGOFF', 'skip', 'machinename'))]
Weird:
[('2014-01-29 12:00:52', 'LOGOFF', 'dooz', 'machinename'),
 ('2014-01-29 12:00:52', 'LOGOFF', 'booz', 'machinename'),
 ('2014-01-29 12:00:52', 'LOGON', 'fooz', 'machinename')]

Question 2

First, your logon[0] would return you the dates. You need to use logon[1] to retrieve LOGON or LOGOFF. And then for your condition, to retrieve the name you need to call logon[3]

Question 3

Try using next and a slice of your log_lines starting at the next line:

for i, line in enumerate(log_lines):
    if line[1] == 'LOGON':
        found = next(j for j,search in enumerate(log_lines[i+1:],i+1) 
            if search[1] == 'LOGOFF' and line[2] == search[2])
        print('found {} logoff match at index {}'.format(line[2],found))

output:

found jane logoff match at index 1
found skip logoff match at index 3
found jim logoff match at index 5
found skip logoff match at index 7

This efficiently starts the search at the next line instead of iterating the whole list looking for 'LOGOFF' (and stops it immediately once finding a match). next provides some flexibility since you can provide it a default value in case the generator expression is exhausted without finding a match.

i.e.

found = next((j for j,search in enumerate(log_lines[i+1:],i+1) 
            if search[1] == 'LOGOFF' and line[2] == search[2]), None)

If we're at the end of the list and the user hasn't logged off yet, we get None back instead of an error.

Note that this approach handles the same user logging on/off multiple times. Your algorithm doesn't handle that so well!

Question 4

Using slice:

for l in log_lines:
    if l[1] == 'LOGON':
        start = log_lines.index(l)+1
        for item in log_lines[start:]:
            if (l[2]==item[2]) and (item[1]=='LOGOFF'):
                print l[2],"found log on and log off"

output:

jane found log on and log off
skip found log on and log off
skip found log on and log off
jim found log on and log off
skip found log on and log off

Question 5

Your algorithm isn't terrible. You can reduce it a little by using indices. such as:

for i in xrange(len(log_lines)):
    if log_lines[i][0] == 'LOGON':
        name = logon[1]
        for j in xrange(i,len(log_lines)):
            if log_lines[j][0] == 'LOGOFF' and loglines[j][1] == name:
                print log_lines[j]

Doing it this way cuts the algorithm run time in half on average. Note the inner loop starts at the next line, not at the beginning again.