Question

I have a document I'm parsing that is a list of div tags inside it but it sometimes also has just text inline. I need to know how to extract the contents from them in order.

Say I have the following:

<div>
<div>1</div>
<div>2</div>
3
<div>4</div>
</div>

I need to extract all the text above so it reads 1234.

I have the following code which gets all the div tags but won't get the text by itself.

from ghost import Ghost
from BeautifulSoup import BeautifulSoup

def tagfilter(tag):
    return tag.name == 'div'

ghost = Ghost()
ghost.open("testpage.html")

page, resources = ghost.wait_for_page_loaded()

soup = BeautifulSoup(ghost.content)
maindiv = soup.find('div', {'id': 'parentdiv'})
outtext = ''
for s in maindiv.findAll(ipfilter):
    outtext + = s.text
print outtext 
Was it helpful?

Solution

Use stripped_strings (or strings if you need the whitespace):

In [16]: soup = BeautifulSoup('''<div>
<div>1</div>
<div>2</div>
3
<div>4</div>
</div>''')


In [19]: list(soup.stripped_strings)
Out[19]: [u'1', u'2', u'3', u'4']


In [20]: ''.join(soup.stripped_strings)
Out[20]: u'1234'

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top