Frage

Take some rudimentary HTML like this as an example. How could one remove all children nodes past say 2 nodes deep before it truncates and removes it.

<html>
<head>
    <title></title>
    <meta />
    <meta />
    <link />
</head>
<body>
    <div>
        <div>
            <a></a>
            <a></a>
            <a></a>
        </div>
        <span>
            <h1>
                <li></li>
                <li></li>
            </h1>
        </span>
    </div>
</body>

would become something like:

    <html>
<head>
    <title></title>
    <meta />
    <meta />
    <link />
</head>
<body>
    <div>
        <div></div>
        <span></span>
    </div>
</body>

War es hilfreich?

Lösung

The idea is to iterate over all elements recursively and count down the parents:

from bs4 import BeautifulSoup
from urllib2 import urlopen


data = """your html goes here"""

depth = 5
soup = BeautifulSoup(data)
for tag in soup.find_all():
    if len(list(tag.parents)) == depth:
        tag.extract()

print soup.prettify()

prints:

<html>
 <head>
  <title>
  </title>
  <meta/>
  <meta/>
  <link/>
 </head>
 <body>
  <div>
   <div></div>
   <span></span>
  </div>
 </body>
</html>

Andere Tipps

Maybe something like:

for child in body.children:
    for element in child.children:
        element.clear()
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top