How can I print organized ngrams from my email?

https://stackoverflow.com/questions/10119007

30-05-2021
|

Question

I need to do two things at this point but I need your help:

A best practice to clean up data - programmatically deleting superfluous tags & the '>>>>>>>', plus other non meaningful communication flotsam and jetsum
Once it's cleaned - how do I pack it up to work nice in django & sqlite.
- Do I make it into a csv based on date, person, subject, words then input them into my data classes within my database?

Well, before I get into the database, I'd like to be able to sort a sort and display the data cleanly - I have very little experience putting things into databases, the closest I do is work from XML, csv and JSON.

I need to have the ngrams by rankings, for example how many times a certain word shows up in a series of emails by a person. I'm trying to get closer to knowing the streams of how people are talking to me about subjects, etc. a very elementary version of Jon Kleinberg's work analyzing his own emails.

be gentle, be rough but please be helpful :)!

> My output currently looks like this: : 1, 'each': 1, 'Me': 1, 'IN!\r\n\r\n2012/1/31': 1, 'calculator.\r\n>>>>>>\r\n>>>>>>': 1, 'people': 1, '=97MB\r\n>\r\n>': 1, 'we': 2, 'wrote:\r\n>>>>>>\r\n>>>>>>': 1, '=\r\nwrote:\r\n>>>>>\r\n>>>>>>': 1, '2012/1/31': 2, 'are': 1, '31,': 5, '=97MB\r\n>>>>\r\n>>>>': 1, '1:45': 1, 'be\r\n>>>>>': 1, 'Sent':

  import getpass, imaplib, email

# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):

  # parameter n is the 'order' (length) of the desired n-gram
  def __init__(self, text):
    self.text = text
    self.ngrams = dict()

    # feed method calls tokenize to break the given string up into units
  def tokenize(self):
    return self.text.split(" ")

  # feed method takes text, tokenizes it, and visits every group of n tokens
  # in turn, adding the group to self.ngrams or incrementing count in same
  def parse(self):

    tokens = self.tokenize()
    #Moves through every individual word in the text, increments counter if already found
    #else sets count to 1
    for word in tokens:
        if word in self.ngrams:
            self.ngrams[word] += 1
        else:
            self.ngrams[word] = 1

  def get_ngrams(self):
    return self.ngrams

#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages

def get_first_text_part(msg): #where should this be nested? 
    maintype = msg.get_content_maintype()
    if maintype == 'multipart':
        for part in msg.get_payload():
            if part.get_content_maintype() == 'text':
                return part.get_payload()
    elif maintype == 'text':
        return msg.get_payload()

for num in data[0].split(): #Loops through all messages
    typ, data = M.fetch(num, '(RFC822)') #Pulls Message
    msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
    _from =  msg['from'] #pull from
    _to = msg['to'] #pull to
    _subject = msg['subject'] #pull subject
    _body = get_first_text_part(msg) #pull body
    if _body:
        ngrams = NGramCounter(_body)
        ngrams.parse()
        _feed = ngrams.get_ngrams()
        # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
        print _feed
    # print 'Content-Type:',msg.get_content_type()
    #     print _from
    #     print _to
    #     print _subject
    #     print _body
    #    

    new.write(_from)

    print '---------------------------------'

M.close()
M.logout()

Solution

There is nothing wrong in your main loop. The process though is somewhat slow as you need to retrieve all your emails from an external server. What I'd suggest is to download all the messages on the client once. Then save them into a database (sqlite, zodb, mongodb.. the one you prefer) and then perform all the analysis that you want on the db objects afterwards. The two processes (downloading and analyzing) are better kept a part one from each other otherwise tuning them up would result complicated and code complexity would increase.

OTHER TIPS

replace

if _body:
    ngrams = NGramCounter(_body)
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
    print _feed

with

if _body:
    ngrams = NGramCounter(" ".join(_body.strip(">").split()))
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    print _feed

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow