Question

Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:

decode_rfc2231 only splits the value of such a parameter into its parts, like this:

>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']

decode_params takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.

>>> email.utils.decode_params([
...   (1,2),
...   ("foo","bar"),
...   ("name*","utf-8''T%C3%A4st.txt"),
...   ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]

collapse_rfc2231_value can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.

>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
...  email.utils.decode_params([
...   (1,2),
...   ("foo","bar"),
...   ("name*","utf-8''T%C3%A4st.txt"),
...   ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]

So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?

What is the best reference on how to use these functions?

The best I found so far is the email.message.Message implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue after the decode_params, and only get_filename and get_boundary collapse their values, all others return a tuple instead. I hope there is something more useful.

Was it helpful?

Solution

Currently the functions from email.utils are rarely used besides within email.message. Most users seem to prefer using email.message.Message directly. There's even a somewhat old issue report on adding unit tests (that would certainly be usable as examples) to Python, even if I'm not sure on how it relates to email.util.

A short example I found is this blogpost which, however, doesn't contain more than once sentence and a few SLOCs of information about RFC2231 parsing. The author notes, however, that many MTAs use RFC2047 instead. Depending on your usecase, that might also be an issue.

Judging from the few examples I could find I assume your way of parsing using email.util is the only way to go, even if the long list comprehension is somewhat ugly.

Because of the lack of examples in some respect it could be wise to write a new RFC2231 parser (if you really need a better, maybe faster or more beautiful codebase). A new implementation could be based on existing implementations like the Dovecot RFC2231 parser for compatibility reasons (you could even use the Dovecot unit test. As the C code seems quite complex to me and since I can't find any python implementation besides email.util and Python2 backports of email.util the task of porting to Python won't be easy (note that Dovecot is LGPL-licensed, which might be an issue in your project)

I think the email.util RFC2231 API has not been designed for easy standalone usage but more as a pile of utility methods for use in email.message.Message.

OTHER TIPS

Old question, but I could not find a complete answer that works on this. So this is what I ended up doing (on Python 2.7):

def decode_rfc2231_header(header):
    """Decode a RFC 2231 header"""
    # Remove any quotes
    header = email.utils.unquote(header)
    encoding, language, value = email.utils.decode_rfc2231(header)
    value = urllib.unquote(value)
    return email.utils.collapse_rfc2231_value((encoding, language, value))

For example:

>>> name = u'èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń'
>>> encoded_header = email.utils.encode_rfc2231(name.encode("utf8"), 'utf8', 'en')
>>> print encoded_header 
utf8'en'%C3%A8%C3%A9%C3%AA%C3%AB%C4%93%C4%97%C4%99%C3%BB%C3%BC%C3%B9%C3%BA%C5%AB%C3%A0%C3%A1%C3%A2%C3%A4%C3%A6%C3%A3%C3%A5%C4%81%C4%81%C3%AE%C3%AF%C3%AD%C4%AB%C4%AF%C3%AC%20test%20%C3%B4%C3%B6%C3%B2%C3%B3%C5%93%C3%B8%C5%8D%C3%B5ss%C5%9B%C5%A1%C5%82%C5%BE%C5%BA%C5%BC%C3%A7%C4%87%C4%8D%C3%B1%C5%84
>>> print decode_rfc2231_header(encoded_header)
èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top