Question

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.

I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.

So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?

Thanks for your suggestions!

Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

When I run the above code, I get the following error:

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.

Could you please let me know what I am doing wrong?

-ecognium

Was it helpful?

Solution

An approach that works for your specific example is:

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

which emits

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!

OTHER TIPS

When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)

The tags argument is either a single tag or a list of tags to be flattened.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top