Question

So I have a webpage that has a large list of links. They are all contained inside <li> tags.

The <li> tags are inside a <ol> tag inside a <div> and so on like this:

html --> body --> table --> tbody --> tr --> td --> table --> tbody --> tr --> td --> div --> ol

And then the <li> tags are inside the <ol>.

How can I use lxml in Python to print the <li> tags' html as text?

Was it helpful?

Solution

Using BeautifulSoup (which builds on the lxml library)

import bs4

text = """<html>
 <body>
  <table>
   <tbody>
    <tr>
     <td>
      <table>
       <tbody>
        <tr>
         <td>
          <div>
           <ol>
            <li>
             <a href="test.html" title="test title">Link Text</a>
             <a href="test2.html" title="test title 2">Link2 Text</a>
            </li>
           </ol>
          </div>
         </td>
        </tr>
       </tbody>
      </table>
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>"""

soup = bs4.BeautifulSoup(text)

listitems = soup.select("table > tbody > tr > td > table > tbody > tr > td > div > ol > li")
tags = [tag for tag in listitems[0] if isinstance(tag,bs4.element.Tag)]
for tag in tags:
    print(tag)

# OUTPUT
# <a href="test.html" title="test title">Link Text</a>
# <a href="test2.html" title="test title 2">Link2 Text</a>

OTHER TIPS

The solution below should do it in lxml, however, beautiful soup will probably be a much better solution and handle malformed HTML much better.

import lxml.etree as etree

tree = etree.parse(open("test.html"))
for li in tree.iterfind(".//td/div/ol/li"):
    print etree.tostring(li[0])

I'll edit with a beautifulsoup answer in a minute. EDIT: See Adam's solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top