Extracting HTML data fields with Python

https://stackoverflow.com/questions/15541085

29-03-2022
|

문제

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.

<div class="profile-section" id="a-bit-more-about">
                            <dl>
            <dt>Name:</dt>
            <dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
        </dl>
        <!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
                        <dl>
        <dt>Joined:</dt>
        <dd>September 1910</dd>
    </dl>
    <div class="sep"></div>
    <dl>
        <dt>Hometown:</dt>
        <dd>Quiet Rest Maximum Security Twilight Home</dd>
    </dl>
    <dl>
        <dt>Currently:</dt>
        <dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
    </dl>
    <div class="sep"></div>

해결책 2

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')

Or if like, you can use regex for small target.

다른 팁

You want an HTML parser. I recommend beautiful soup or lxml.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow