Question

The goal is to remove bunch of email messages using imaplib. Email folder receives approximately 300k new messages a month. Only messages that are older than 1 month should be deleted. If executing this script it will delete old messages, but deletion takes a lot of time and simple for iteration does not look effective. It takes several hours. By trying to increase speed with multiprocessing gives error.

What can you advise to improve the speed of deleting big amount of messages?

import sys
import datetime
from imaplib import IMAP4

# get the date a month from the current
monthbefore = (datetime.date.today() - datetime.timedelta(365/12)).strftime("%d-%b-%Y")

m = IMAP4('mail.domain.com')
m.login('user@domain.com', 'password')

# shows how many messages in selected folder
print m.select('Folder')
typ, data = m.select('Folder')

# find old messages
typ, data = m.search(None, '(BEFORE %s)' % (monthbefore))

# delete them
print "Will be removed:\t", data[0].split()[-1],"messages"
for num in data[0].split():
  m.store(num, '+FLAGS', '\\Deleted')
  sys.stderr.write('\rRemoving message:\t %s' % num)

# now expunge marked for deletion messages, close connection and exit
print "\nGet ready for expunge"
m.expunge()
print "Expunged! Quiting."
m.close()
m.logout()

Update: Rewrited part of a code, here is a 1000 times faster working variant (my server supports store command to more than 1000 messages at a time):

    def chunks(l, n):
        # yields successive n-sized chunks from l.
        for i in xrange(0, len(l), n):
            yield l[i:i+n]

    mcount = data[0].split()[-1]
    print "Will be removed", mcount, "messages"
    for i in list(chunks(data[0].split(), 1000)):
        m.store(",".join(i), '+FLAGS', '\\Deleted')
        sys.stderr.write('\rdone {0:.2f}%'.format((int(i[-1])/int(mcount)*100)))
Was it helpful?

Solution

I think the main problem here is that you're calling STORE for each message. Each one of those round trips to the server takes time and when you're doing lots of deletions this really adds up.

To avoid all those calls to STORE trying calling it with multiple message ids. You can either pass a comma separate listed (e.g. "1,2,3,4"), ranges of message ids (e.g. "1:10") or a combination of both (e.g. "1,2,5,1:10"). Note that most servers seem to have a limit on the number of message ids allowed per call so you'll probably still need to chunk the ids into blocks (of say 200 messages) and call STORE multiple times. This will still be much, much faster than calling STORE per message.

For further reference, see the STORE Command section of RFC 3501. It shows an example of a STORE command taking a range of message ids.

OTHER TIPS

It takes a certain amount of time to do the deletion and if you do them one at a time, it's going to take a long, long time. The overhead of the for loop is minuscule compared to the amount of time you're spending waiting for the server to do its thing. Several hours is not out of line, nor does it strike me as particularly problematic; you have several hours, I am sure. If you don't, just start sooner.

Still, if it's a problem, you're on the right track with threading or multiprocessing. I don't know what you mean by "gives error"; a little more specificity might be nice before giving up on that approach. If you mean your server won't allow multiple simultaneous logins, that can probably be configured on your IMAP server. (I use CommuniGate Pro to handle e-mail for my domain and it allows this.)

Another approach would be to run the deletion script once a day or even once an hour so the time cost is spread out over the month. You might also try POP3 instead of IMAP to see if it's any faster for this application.

I'm afraid there's little you can do. In IMAP, flagging a message as deleted is pretty quick; it's the expunge that's the killer.

And you can't do that in multiprocessing, because only one thread is allowed to lock the mailbox to do the physical expunging.

If you tried to run a multiprocessing delete - on most servers I believe you actually could - you'd speed up dramatically a process that's already very quick. But then the single thread running expunge would need to lock for a long time; depending on the server, you might even be unable to login during this "dead time". Some other servers (I think Icewarp's Merak) will allow normal operations during an expunge (you're simply not allowed to run a second expunge until the first has finished).

UPDATE

I have done a bit of experimenting. I found that to have different connections through imaplib, the login itself has to be moved into a Thread.

So I set up the app like this:

  • the main app logs in and retrieves the list of messages to be deleted
  • divides the messages in N chunks
  • starts N threads that run the login
  • all threads wait for a small time to let every thread complete the login (so that message indexes are the same through all connections); I really should have employed synchronization here
  • each thread starts deleting its assigned portion of the messages, then logs out
  • when all the threads have finished, the main app continues and purges the mailbox

I noticed a performance increase with N=2 which increased at N=3. Then no increase for N=4, i.e., each thread in a set of four took the same time to delete 25 messages than a thread in a set of three employed to delete 33 messages. Again no increase for N from 5 up to 7. At N=8 performance started decreasing; at ten the server stopped accepting connections.

In my best scenario I estimate the deletion time to have been around 40% of nominal with three threads running; I'm not sure whether this justifies the trouble.

But these values are in all probability strongly dependent on server architecture, and hardware (how many processors and cores, how much memory) as well as maximum number of concurrent connections allowed. So you might take more advantage from a multithreaded approach.

I also ran some tests server side. Since most IMAP servers ( http://en.wikipedia.org/wiki/Comparison_of_mail_servers ) store their data in a variant of Maildir format, one file for message, and the timestamp of the message is embedded in the file name, I experimented with a program to delete any files containing an older timestamp. This method has the disadvantage of requiring the user to be logged off, but it is very fast.

It would also be possible, and I think it would not really interfere with user operations, to mark files as "to be deleted" (adding 'T' to the info file suffix) so all that remains is to issue an expunge command to make the server physically kill the files and recalculate immediately the quotas, if any.

Running periodically such a program would accomplish message expiration more efficiently, if access to the server can be obtained.

Throwing one big chunk at it works for me, the email server breaks it down for itself. Tailor this as you wish for non Gmail IMAP servers:

#!/bin/python

import datetime
import imaplib

m = imaplib.IMAP4_SSL("imap.gmail.com")  # server to connect to
m.login('gmail@your_gmail.com', 'your_password')

print m.select('[Gmail]/All Mail')  
before_date = (datetime.date.today() - datetime.timedelta(365)).strftime("%d-%b-%Y")  # date string, 04-Jan-2013
typ, data = m.search(None, '(BEFORE {0})'.format(before_date))  

if data != ['']:  # messages exist
    no_msgs = data[0].split()[-1]  # last msg id in the list
    print "To be removed:\t", no_msgs, "messages found with date before", before_date
    m.store("1:{0}".format(no_msgs), '+X-GM-LABELS', '\\Trash')  # move to trash, can also set Delete Flag here instead
    print "Deleted {0} messages. Closing connection & logging out.".format(no_msgs)
else:
    print "Nothing to remove."

#This block empties trash, Gmail auto purges trash after 30 days anyways.
print("Emptying Trash & Expunge...")
m.select('[Gmail]/Trash')  # select all trash
m.store("1:*", '+FLAGS', '\\Deleted')  #Flag all Trash as Deleted
m.expunge()  # not need if auto-expunge enabled in Gmail

m.close()
m.logout()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top