Question

I am trying to replace some elements (class: method) in a long html-website using .replaceWith. For that i use .descendants and iterate over them to check if the dl-element is what i am looking for. But that works only for 0<= X <= 2 elements which are next to each other. Every 3rd to n'th element in a row is "ignored". Executing the same code twice results in 4 replaced dl-elements in a row and so on.

for elem in matches:
 for child in elem.descendants:
    if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
         text = "<p>***removed something here***</p>"
         child.dl.replaceWith(BeautifulSoup(text))

The (very silly) solution for that is to find the maximum of dl-elements in a row, divide it by two and execute that often. I would like to get a smart (and fast) solution for that and (even more important) understand whats going wrong here.

EDIT: html-website for testing is this one: https://docs.python.org/3/library/stdtypes.html and the error can be seen in chapter 4.7.1 string methods (a lot of methods available there)

EDIT_2: But i do not just use that html-website, but parts of it. The html-parts are stored in a list and and i just want dl-elements to be "removed" if they are not the first html-element (so i want to keep the element if it is the head).

All together this is how my code looks actually:

from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(open("/home/sven/Bachelorarbeit/python-doc-extractor-for-cado/extractor-application/index.html"))
f = open('test.html','w')    #needs to exist
f.truncate
matches=[]

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements

sections = soup.find_all(['div'], attrs = {'class':'section'})   #grab all section-elements

matches = dl_elems + sections   #merge the lists to get all results

for elem in matches:
  for child in elem.descendants:
      if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
           text = "<p>***removed something here***</p>"
           child.dl.replaceWith(BeautifulSoup(text))


print(matches,file=f)
f.close()
Was it helpful?

Solution

The idea is to find all dl elements that has class="method" and replace them with a p tag:

import urllib2
from bs4 import BeautifulSoup, Tag

# get the html
url = "https://docs.python.org/3/library/stdtypes.html"
soup = BeautifulSoup(urllib2.urlopen(url))

# replace all `dl` elements with `method` class
for elem in soup('dl', class_='method'):
    tag = Tag(name='p')
    tag.string = '***removed something here***'
    elem.replace_with(tag)

print soup.prettify()

UPD (adapted to the question edit):

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements
sections = soup.find_all(['div'], attrs={'class': 'section'})   #grab all section-elements

for parent in dl_elems + sections:
    for elem in parent.find_all('dl', {'class': 'method'}):
        tag = Tag(name='p')
        tag.string = '***removed something here***'
        elem.replace_with(tag)

print dl_elems + sections
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top