Pergunta

I have a web page saved as a .htm. Essentially, there are 6 layers of divs that I need to parse and get specific data out of and I'm quite confused as to how to approach this. I've tried different techniques but nothing is working.

The HTM file has a bunch of tags but there is a div that looks like this:

<div id="fbbuzzresult" class.....>
   <div class="postbuzz"> .... </div>
      <div class="linkbuzz">...</div>
      <div class="descriptionbuzz">...</div>
      <div class="metabuzz>
         <div class="time">...</div>
      <div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
</div>

I'm currently attempting BeautifulSoup. Some more context...

  1. There is only ONE fbbuzzresult in the entire file.
  2. There is multiple postbuzz within the fbbuzzresult
  3. There are divs as shown above, within the postbuzz

I need to extract and print each of the things shown above within each postbuzz div.

Your help and guidance towards some skeleton code is greatly appreciated! P.S - Ignore the dashes in the div class. Thanks!

Foi útil?

Solução

You should be able to just use your result in the same way as your parent soup:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
post_buzz = div.findAll("div",{"class":"postbuzz"})

But I have run into errors before doing it this way, so as a secondary method you can just make a sort of sub_soup:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
sub_soup = bs(str(div))
post_buzz = sub_soup.findAll("div",{"class":"postbuzz"})

Outras dicas

First read the BeautifulSoup documentation http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Second, here is a small example to get you going:

from bs4 import BeautifulSoup as bs

soup = bs(your_html_content)

# for fbbuzzresult
buzz = soup.findAll("div", {"id" : "fbbuzzresult"})[0]

# to get postbuzz
pbuzz = buzz.findAll("div", {"class" : "postbuzz"})

"""pbuzz is now an array with the postbuzz divs
   so now you can iterate through them, get
   the contents, keep traversing the DOM with BS 
   or do whatever you are trying to do

   So say you want the text from an element, you
   would just do: the_element.contents[0]. However
   if I'm remembering correctly you have to traverse 
   down through all of it's children to get the text.
"""
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top