I could get the data you wanted by using the below regex using python's re.S
flag.
r'(From:.*).*(To:.*).*(CC:.*).*(Subject:.*).*(Date:.*)'
You could do something like this:
In [1]: data = '''
...: From: "John Smith" <jsmith@jsmith.com>
...: To: <john.doe.1@gmail.com>, <john.doe.2@gmail.com>,
...: <john.doe.3@gmail.com>, <john.doe.4@gmail.com>,
...: <john.doe.6@yahoo.com>, <john.doe.5@gmail.com>, <jdoe@live.com>,
...: <j.doe.5@live.com>
...: CC:
...: Subject: Test Email Extraction
...: Date: Sun, 6 Apr 2014 19:30:55 -0400
...: -----------------
...: Testing Email extraction.
...: '''
In [2]: import re
In [3]: results = re.findall(r'(From:.*).*(To:.*).*(CC:.*).*(Subject:.*).*(Date:.*)', data, re.S)
In [4]: headers = ['From', 'To', 'CC', 'Subject', 'Date']
In [6]: data = [item.strip() for item in results[0]]
In [7]: data
Out[7]:
['From: "John Smith" <jsmith@jsmith.com>',
'To: <john.doe.1@gmail.com>, <john.doe.2@gmail.com>,\n<john.doe.3@gmail.com>, <john.doe.4@gmail.com>,\n<john.doe.6@yahoo.com>, <john.doe.5@gmail.com>, <jdoe@live.com>,\n<j.doe.5@live.com>',
'CC:',
'Subject: Test Email Extraction',
'Date: Sun, 6 Apr 2014 19:30:55 -0400\n-----------------\nTesting Email extraction.']
You have the results in data
list. Use the csv
module with \t
as the delimiter and write out the headers and the data in the format you want. Of course there are \n
s in there, but you can strip those out by traversing through the items in the list before writing to the file.
Hope this helps.