Question

Ok I've got text files of emails that I need to extract the "From", "To", "CC", "Subject" and "Date" fields and write them to a CSV in the following format:

Date    Subject    From    To    CC

The files are similar to this:

From: "John Smith" <jsmith@jsmith.com>
To: <john.doe.1@gmail.com>, <john.doe.2@gmail.com>,
<john.doe.3@gmail.com>, <john.doe.4@gmail.com>,
<john.doe.6@yahoo.com>, <john.doe.5@gmail.com>, <jdoe@live.com>,
<j.doe.5@live.com>
CC: 
Subject: Test Email Extraction
Date: Sun, 6 Apr 2014 19:30:55 -0400
-----------------
Testing Email extraction.

The problem that I run into is that the "TO" and "CC" lines almost always has many entries taking up multiple lines.

I thought the solution to extracting this info to put into the CSV would be to use a REGEX but I have had no luck at all...

Not even getting close.

Any suggestions?

Was it helpful?

Solution

I could get the data you wanted by using the below regex using python's re.S flag.

r'(From:.*).*(To:.*).*(CC:.*).*(Subject:.*).*(Date:.*)'

You could do something like this:

In [1]: data = '''
   ...: From: "John Smith" <jsmith@jsmith.com>
   ...: To: <john.doe.1@gmail.com>, <john.doe.2@gmail.com>,
   ...: <john.doe.3@gmail.com>, <john.doe.4@gmail.com>,
   ...: <john.doe.6@yahoo.com>, <john.doe.5@gmail.com>, <jdoe@live.com>,
   ...: <j.doe.5@live.com>
   ...: CC:
   ...: Subject: Test Email Extraction
   ...: Date: Sun, 6 Apr 2014 19:30:55 -0400
   ...: -----------------
   ...: Testing Email extraction.
   ...: '''
In [2]: import re
In [3]: results = re.findall(r'(From:.*).*(To:.*).*(CC:.*).*(Subject:.*).*(Date:.*)', data, re.S)
In [4]: headers = ['From', 'To', 'CC', 'Subject', 'Date']
In [6]: data = [item.strip() for item in results[0]]
In [7]: data
Out[7]:
['From: "John Smith" <jsmith@jsmith.com>',
 'To: <john.doe.1@gmail.com>, <john.doe.2@gmail.com>,\n<john.doe.3@gmail.com>, <john.doe.4@gmail.com>,\n<john.doe.6@yahoo.com>, <john.doe.5@gmail.com>, <jdoe@live.com>,\n<j.doe.5@live.com>',
 'CC:',
 'Subject: Test Email Extraction',
 'Date: Sun, 6 Apr 2014 19:30:55 -0400\n-----------------\nTesting Email extraction.']

You have the results in data list. Use the csv module with \t as the delimiter and write out the headers and the data in the format you want. Of course there are \ns in there, but you can strip those out by traversing through the items in the list before writing to the file.

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top