Extracting HTML data fields with Python

https://stackoverflow.com/questions/15541085

29-03-2022
|

Question

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.

<div class="profile-section" id="a-bit-more-about">
                            <dl>
            <dt>Name:</dt>
            <dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
        </dl>
        <!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
                        <dl>
        <dt>Joined:</dt>
        <dd>September 1910</dd>
    </dl>
    <div class="sep"></div>
    <dl>
        <dt>Hometown:</dt>
        <dd>Quiet Rest Maximum Security Twilight Home</dd>
    </dl>
    <dl>
        <dt>Currently:</dt>
        <dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
    </dl>
    <div class="sep"></div>

Solution 2

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')

Or if like, you can use regex for small target.

OTHER TIPS

You want an HTML parser. I recommend beautiful soup or lxml.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow