Question

I have an email client in Django. Currently supporting GMail accounts using imaplib.

My problem is: I want to obtain the attachment names without having to download the full email. Currently, in order to obtain the attachment names, or the email body, I need to download the whole email using the fetch function with the parameter (RFC822).

I know I can obtain specific fields only using HEADER.FIELDS, for the subject, from, cc for example. But is there a way to obtain the attachment names or the email body without downloading the whole email?

What I mean specifically is: let's say I have a 30Mb email that has one line of text in the body and two 15Mb attachments. I want to obtain the attachment names and that line of text without downloading the full 30Mb body.

Thank you

Was it helpful?

Solution

Assuming you're asking what I think you're asking, here's what to do:

First, fetch the BODYSTRUCTURE. Assuming gmail's IMAP server supports this, you'll get back something like this:

(("TEXT" "PLAIN" ("CHARSET" "UTF-8") NIL NIL "QUOTED-PRINTABLE" 56 1 NIL NIL NIL NIL)
 ("TEXT" "HTML" ("CHARSET" "UTF-8") (NAME "") NIL NIL "BASE64" 12345 NIL 
  ("attachment" ("FILENAME" "")) NIL NIL) 
 ("IMG" "JPEG" (NAME "funny picture") NIL NIL "BASE64" 56789 NIL
  ("attachment" ("FILENAME" "image.jpg")) NIL NIL))
 "MIXED" ("BOUNDARY" "----_=_NextPart_001_1234ABCD.56789EF0") NIL NIL NIL)

And then fetch the (BODY ENVELOPE) is the structure has one.

If you look at RFC3501 7.4.2, it explains how to deal with these.

Once you've determined that the (BODY[1]) and (BODY[2]) are the plain-text and HTML versions of the main content, and (BODY[3]) is the first real attachment, you download the plain-text body by fetching (BODY[1]), and you've got the name of the attachment from the structure.

Sorry there's no code here. I don't think either imaplib or any of the stdlib MIME- and mail-related modules will do the hard part for you (interpreting the structure), but I haven't actually checked, so I'd look there first, and, if not, go to PyPI to see if anyone else has already written the code.

Well, actually, first I'd just fetch BODYSTRUCTURE, (BODY ENVELOPE) and (BODY[3]) for a specific message to make sure gmail has complete support before writing a whole mess of code…

PS, if worst comes to worst, if your use case is as simple and rigid as you described, you can just always fetch BODYSTRUCTURE and (BODY[1]), fall back to RFC822 if that fails, and get the attachment names by running a hacky regexp on the structure instead of a real parse. I wouldn't write this for anything but a one-shot script or a quick&dirty prototype to learn about gmail, but for those cases, I probably would.

OTHER TIPS

[Edit]

Ok here we go =)

>>> import imaplib, email
>>> mail = imaplib.IMAP4_SSL('imap.gmail.com')
>>> mail.login('emailaddr@gmail.com', 'password')
('OK', ['emailaddr@gmail.com Inget Namn authenticated (Success)'])
>>> mail.select('inbox')
('OK', ['14'])
>>> result, data = mail.uid('search', None, 'ALL')
>>> uids=data[0].split()
>>> result, data = mail.uid('fetch', uids[-1], 'BODYSTRUCTURE')
>>> print data
['14 (UID 340 BODYSTRUCTURE ((("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 17 1 NIL NIL NIL)("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 17 1 NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "20cf3071d16a5a877b04d0adcc43") NIL NIL)("APPLICATION" "PDF" ("NAME" "attiny40.pdf") NIL NIL "BASE64" 8429956 NIL ("ATTACHMENT" ("FILENAME" "attiny40.pdf")) NIL) "MIXED" ("BOUNDARY" "20cf3071d16a5a878104d0adcc45") NIL NIL))']
>>>

The attachement for this message is called "attiny40.pdf" and you can clearly see that name in the BODYSTRUCTURE. All that is left is parsing that BODYSTRUCTURE.

The code is pretty much taken straight from the last link below.

[/Edit]

You will need to change the parameter for fetch from RFC822 to BODYSTRUCTURE.

And then as described here for example.

For example, a two part message consisting of a text and a BASE64-encoded text attachment can have a body structure of: (("TEXT" "PLAIN" ("CHARSET" "US-ASCII") NIL NIL "7BIT" 1152 23)("TEXT" "PLAIN" ("CHARSET" "US-ASCII" "NAME" "cc.diff") "960723163407.20117h@cac.washington.edu" "Compiler diff" "BASE64" 4554 73) "MIXED")

See also this post and this one. The last link looks like pretty much as what you are trying to do.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top