How to parse HTML from eMail body - Python

https://stackoverflow.com/questions/17641490

03-06-2022
|

Question

I'm trying to parse incoming emails in python. I get emails which are part text part HTML. I want to get the HTML part and find a table in the HTML.

I tried using beatifulsoup. But when trying the next code, the bs only get the first "" part and not all the HTML part :

# connecting to the gmail imap server
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
# use m.list() to get all the mailboxes, "INBOX" to get only inbox
m.select("INBOX")
resp, items = m.search(None, '(UNSEEN)') # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
items = items[0].split() # getting the mails id

for emailid in items:
    # getting the mail content
    resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0][1])
    soup = bs(text)

How can I use 'bs' for the entire HTML part? Or, is there any other way to parse out an html table from the email body?

'bs' seems to be the best for me, cause I want to find a specific HTML Body which contains specific keyword, and 'bs' search can retrieve the entire table and let me iterate in it.

Solution

Apparently, I used a wrong parser.

Once I changed into 'lxml' parser, it worked just fine.

need to change the next line:

soup = bs(text,"lxml");

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow