Question

Here's my code: I have a script that reads a file but in my file not all the lines are similar and I'd like to extract informations only from lines that have I DOC O:.

I've tried with an if condition but it still doesn't work when there are lines where regex aren't matching:

#!/usr/bin/env python 

# -*- coding: utf-8 -*-

import re 

def extraire(data):
    ms = re.match(r'(\S+).*?(O:\S+).*(R:\S+).*mid:(\d+)', data) # heure & mid 
    return {'Heure':ms.group(1), 'mid':ms.group(2),"Origine":ms.group(3),"Destination":ms.group(4)}

tableau = []  

fichier = open("/home/TEST/file.log")
f = fichier.readlines() 
for line in f: 
    if (re.findall(".*I Doc O:.*",line)):     
    tableau = [extraire(line) for line in f ]

print tableau
fichier.close()

And here's an example of some lines of my file here i want first and fourth lines..:

01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261
01:09:41.965 mta         Messages       I Rep O:NVS:SMTP/alarmes.techniques@xxx.de R:NVS:SMS/+455451 mid:6261
01:09:41.965 mta         Messages       I Rep 6261 OK, Accepted (ID: 26)
08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262
08:14:30.630 mta         Messages       I Rep O:NVS:SMTP/alarm@azea.er R:NVS:SMS/+33688704859 mid:6262
08:14:30.630 mta         Messages       I Rep 6262 OK, Accepted (ID: 28)
Was it helpful?

Solution

From: http://docs.python.org/2/library/re.html

?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against ...

Also, findall is best used w/ your entire buffer, and returns a list, hence looping over matches saves you from having to do a conditional against each line of your file.

buff = fichier.read()
matches = re.findall(".*?I Doc ):.*", buff)
for match in matches:
    tableau = ...

-Here is my test code, could you tell me what it's doing, that you didn't want?

>>> import re
>>> a = """
... 01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261
... 01:09:41.965 mta         Messages       I Rep O:NVS:SMTP/alarmes.techniques@xxx.de R:NVS:SMS/+455451 mid:6261
... 01:09:41.965 mta         Messages       I Rep 6261 OK, Accepted (ID: 26)
... 08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262
... 08:14:30.630 mta         Messages       I Rep O:NVS:SMTP/alarm@azea.er R:NVS:SMS/+33688704859 mid:6262
... 08:14:30.630 mta         Messages       I Rep 6262 OK, Accepted (ID: 28)"""
>>> m = re.findall(".*?I Doc O:.*",a)
['01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261', '08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262']

>>> tableau = []
>>> for line in m:
...     tableau.append( extraire(line) )
... 
>>> tableau
[{'Origine': 'R:NVS:SMS/+654811', 'Destination': '6261', 'Heure': '01:09:25.258', 'mid': 'O:NVS:SMTP/alarm@yyy.xx'}, {'Origine': 'R:NVS:SMS/+654646', 'Destination': '6262', 'Heure': '08:14:14.469', 'mid': 'O:NVS:SMTP/alarm@xxxx.en'}]

you could also do this in a single line as

>>> tableau = [ extraire(line) for line in re.findall( ".*?I Doc ):.*", fichier.read() ) ]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top