Question

I am trying to learn python and BS4, and I am trying to extract some frames and iframes from pages using BS4 like so:

#...snip...
soup_f = soup("frame")
if soup_f is not None:
    for frame in soup_f:
        try:
            t_iFrames_src.append(force_text(soup.frame.extract().get("src"), encoding='utf-8', strings_only=False, errors='strict'))
        except (AttributeError, UnicodeEncodeError):
            pass
        try:
            t_full_frame.append(force_text(soup.frame.extract(), encoding='utf-8', strings_only=False, errors='strict'))
        except (AttributeError, UnicodeEncodeError):
            pass
else:
    pass

The problem is, when the first try..except runs, it gets me valid results (by filling t_iFrames_src), but for some weird reason, the second try...except does not give me any results. i.e t_full_frame is empty

So, when I flip them around like so:

#...snip...
soup_f = soup("frame")
if soup_f is not None:
    for frame in soup_f:
        try:
            t_full_frame.append(force_text(soup.frame.extract(), encoding='utf-8', strings_only=False, errors='strict'))
        except (AttributeError, UnicodeEncodeError):
            pass
        try:
            t_iFrames_src.append(force_text(soup.frame.extract().get("src"), encoding='utf-8', strings_only=False, errors='strict'))
        except (AttributeError, UnicodeEncodeError):
            pass
else:
    pass

Now, t_full_frame has results but t_iFrames_src is empty.. I am baffled as to why this is so :(

Probably it is something VERY stupid, but Iam not able to figure out what is wrong! Would really appreciate if someone could point me in the right direction.

Was it helpful?

Solution

When you call soup.tag.extract(), BeautifulSoup removes and returns the first instance of tag from the soup. Observe the following:

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
<frame src='foo'>Spam</frame>
<frame src='bar'>Eggs</frame>
''')
print(soup)

soup.frame.extract()
print(soup)

This gives the following output:

<frame src="foo">Spam</frame>
<frame src="bar">Eggs</frame>


<frame src="bar">Eggs</frame>

I'm guessing this isn't the behavior you want - the first try block is kicking the frame out of the soup, and so it isn't available to the second try block. You probably want to keep the soup intact, in which case, you shouldn't use .extract(). Replace your calls to soup.frame.extract() with just references to frame (the variable in your for loop).

That is, change these lines:

t_iFrames_src.append(force_text(soup.frame.extract().get("src"), encoding='utf-8', strings_only=False, errors='strict'))
t_full_frame.append(force_text(soup.frame.extract(), encoding='utf-8', strings_only=False, errors='strict'))

to these lines:

t_iFrames_src.append(force_text(frame.get("src"), encoding='utf-8', strings_only=False, errors='strict'))
                                ^^^^^
t_full_frame.append(force_text(frame, encoding='utf-8', strings_only=False, errors='strict'))
                               ^^^^^
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top