Python بريد إلكتروني مقتبس من مشكلة الترميز القابلة للطباعة

https://stackoverflow.com/questions/4040074

27-09-2019
|

سؤال

أقوم باستخراج رسائل البريد الإلكتروني من Gmail باستخدام ما يلي:

def getMsgs():
 try:
    conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
  except:
    print 'Failed to connect'
    print 'Is your internet connection working?'
    sys.exit()
  try:
    conn.login(username, password)
  except:
    print 'Failed to login'
    print 'Is the username and password correct?'
    sys.exit()

  conn.select('Inbox')
  # typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
  typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
  for num in data[0].split():
    typ, data = conn.fetch(num, '(RFC822)')
    msg = email.message_from_string(data[0][1])
    yield walkMsg(msg)

def walkMsg(msg):
  for part in msg.walk():
    if part.get_content_type() != "text/plain":
      continue
    return part.get_payload()

ومع ذلك ، فإن بعض رسائل البريد الإلكتروني التي أحصل عليها من المستحيل بالنسبة لي استخراج التواريخ (باستخدام regex) من تشفير chars المرتبطة مثل '=' ، الهبوط عشوائيا في منتصف حقول النص المختلفة. إليك مثال حيث يحدث في نطاق تاريخ أريد استخراجه:

الاسم: Kirsti البريد الإلكتروني: kirsti@blah.blah الهاتف #: + 999 99995192 الإجمالي في الحزب: 4 إجمالي ، 0 الأطفال وصول/المغادرة: 9 أكتوبر = ، 2010 - 13 أكتوبر 2010 - 13 أكتوبر 2010

هل هناك طريقة لإزالة هذه الأحرف الترميز؟

المحلول

يمكنك/يجب استخدام email.parser الوحدة النمطية لفك تشفير رسائل البريد ، على سبيل المثال (مثال سريع وقذرة!):

from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()

# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()

# Or check for errors
print rootMessage.defects

# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)

باستخدام معلمة "فك الشفرة" من Message.get_payload, ، تقوم الوحدة تلقائيًا بفك تشفير المحتوى ، اعتمادًا على ترميزه (على سبيل المثال ، printables مقتبسة كما في سؤالك).

نصائح أخرى

هذا معروف باسم الترميز القابل للطباعة. ربما تريد استخدام شيء مثل quopri.decodestring - http://docs.python.org/library/quopri.html

إذا كنت تستخدم Python3.6 أو أحدث ، يمكنك استخدام email.message.Message.get_content() طريقة لفك تشفير النص تلقائيًا. هذه الطريقة محل get_payload(), ، على أية حال get_payload() لا تزال متاحة.

قل أن لديك سلسلة s تحتوي على رسالة البريد الإلكتروني هذه (بناءً على أمثلة في المستندات):

Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <pepe@example.com>
To: Penelope Pussycat <penelope@example.com>,
 Fabrette Pussycat <fabrette@example.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

    Salut!

    Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.

    [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

    --Pep=C3=A9
   =20

تم تشفير الأحرف غير ASCII في السلسلة مع quoted-printable الترميز ، كما هو محدد في Content-Transfer-Encoding رأس.

إنشاء كائن بريد إلكتروني:

import email
from email import policy

msg = email.message_from_string(s, policy=policy.default)

تحديد السياسة مطلوب هنا ؛ خلاف ذلك policy.compat32 يتم استخدامه ، والذي يعيد مثيل رسالة قديمة لا يحتوي على طريقة get_content. policy.default ستصبح في نهاية المطاف السياسة الافتراضية ، ولكن في Python3.7 لا يزال الأمر كذلك policy.compat32.

ال get_content() الطريقة تتعامل مع فك التشفير تلقائيًا:

print(msg.get_content())

Salut!

Cela ressemble à un excellent recipie[1] déjeuner.

[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718

--Pepé

إذا كان لديك رسالة متعددة ، get_content() يجب استدعاء الأجزاء الفردية ، مثل هذا:

for part in message.iter_parts():
    print(part.get_content())

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow