Python电子邮件引用了可打印的编码问题

https://stackoverflow.com/questions/4040074

27-09-2019
|

题

我正在使用以下内容从Gmail中提取电子邮件：

def getMsgs():
 try:
    conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
  except:
    print 'Failed to connect'
    print 'Is your internet connection working?'
    sys.exit()
  try:
    conn.login(username, password)
  except:
    print 'Failed to login'
    print 'Is the username and password correct?'
    sys.exit()

  conn.select('Inbox')
  # typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
  typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
  for num in data[0].split():
    typ, data = conn.fetch(num, '(RFC822)')
    msg = email.message_from_string(data[0][1])
    yield walkMsg(msg)

def walkMsg(msg):
  for part in msg.walk():
    if part.get_content_type() != "text/plain":
      continue
    return part.get_payload()

但是，我收到的一些电子邮件对于我来说是不可能从AS编码相关的字符中提取日期（使用Regex）的，例如'='，随机降落在各个文本字段中的中间。这是一个示例，在我要提取的日期范围内发生：

姓名：kirsti电子邮件：kirsti@blah.blah电话＃： + 999 99995192派对总数：4总计，0个孩子到达/出发/出发：2010年10月9日 - 2010年10月13日，2010年10月13日 - 2010年10月13日，2010年10月13日

有没有办法删除这些编码字符？

解决方案

您可以/应该使用 email.parser 例如，解码邮件的模块（例如，快速而肮脏的示例！）：

from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()

# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()

# Or check for errors
print rootMessage.defects

# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)

使用的“解码”参数 Message.get_payload, ，该模块会自动解码内容，具体取决于其编码（例如，如您的问题中引用的可打印物）。

其他提示

这被称为引用打印机编码。您可能想使用类似的东西 quopri.decodestring - http://docs.python.org/library/quopri.html

如果您使用的是python3.6或更高版本，则可以使用 email.message.Message.get_content() 自动解码文本的方法。此方法取代 get_payload(), ，尽管 get_payload() 仍然可用。

说你有一个字符串 s 包含此电子邮件（基于例子在文档中）：

Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <pepe@example.com>
To: Penelope Pussycat <penelope@example.com>,
 Fabrette Pussycat <fabrette@example.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

    Salut!

    Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.

    [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

    --Pep=C3=A9
   =20

字符串中的非ASCII字符已编码 quoted-printable 编码，如在 Content-Transfer-Encoding 标题。

创建电子邮件对象：

import email
from email import policy

msg = email.message_from_string(s, policy=policy.default)

这里需要制定政策；否则 policy.compat32 使用，它返回没有GET_CONTENT方法的旧消息实例。 policy.default 最终将成为默认策略，但是从Python3.7开始 policy.compat32.

这 get_content() 方法自动处理解码：

print(msg.get_content())

Salut!

Cela ressemble à un excellent recipie[1] déjeuner.

[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

--Pepé

如果您有多部分消息， get_content() 需要对单个部分进行调用，例如：

for part in message.iter_parts():
    print(part.get_content())

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow