Вопрос

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.

<div class="profile-section" id="a-bit-more-about">
                            <dl>
            <dt>Name:</dt>
            <dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
        </dl>
        <!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
                        <dl>
        <dt>Joined:</dt>
        <dd>September 1910</dd>
    </dl>
    <div class="sep"></div>
    <dl>
        <dt>Hometown:</dt>
        <dd>Quiet Rest Maximum Security Twilight Home</dd>
    </dl>
    <dl>
        <dt>Currently:</dt>
        <dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
    </dl>
    <div class="sep"></div>
Это было полезно?

Решение 2

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')

Or if like, you can use regex for small target.

Другие советы

You want an HTML parser. I recommend beautiful soup or lxml.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top