Can I combine two 'findAll' search blocks in beautifulsoup, into one?
-
22-07-2019 - |
Question
Can I combine these two blocks into one:
Edit: Any other method than combining loops like Yacoby did in the answer.
for tag in soup.findAll(['script', 'form']):
tag.extract()
for tag in soup.findAll(id="footer"):
tag.extract()
Also can I multiple blocks into one:
for tag in soup.findAll(id="footer"):
tag.extract()
for tag in soup.findAll(id="content"):
tag.extract()
for tag in soup.findAll(id="links"):
tag.extract()
or may be there is some lambda expression where I can check whether in array, or any other simpler method.
Also how do I find tags with attribute class, as class is reserved keyword:
EDIT: this part is solved by the soup.findAll(attrs={'class': 'noprint'}):
for tag in soup.findAll(class="noprint"):
tag.extract()
Solution
You can pass functions to .findall()
like this:
soup.findAll(lambda tag: tag.name in ['script', 'form'] or tag['id'] == "footer")
But you might be better off by first building a list of tags and then iterating over it:
tags = soup.findAll(['script', 'form'])
tags.extend(soup.findAll(id="footer"))
for tag in tags:
tag.extract()
If you want to filter for several id
s, you can use:
for tag in soup.findAll(lambda tag: tag.has_key('id') and
tag['id'] in ['footer', 'content', 'links']):
tag.extract()
A more specific approach would be to assign a lambda to the id
parameter:
for tag in soup.findAll(id=lambda value: value in ['footer', 'content', 'links']):
tag.extract()
OTHER TIPS
I don't know if BeautifulSoup can do it more elegantly, but you could merge the two loops like so:
for tag in soup.findAll(['script', 'form']) + soup.findAll(id="footer"):
tag.extract()
You can find classes like so (Documentation):
for tag in soup.findAll(attrs={'class': 'noprint'}):
tag.extract()
The answer to the second part of your question is right there in the documentation:
Searching by CSS class
The attrs argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }), but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary. The string will be used to restrict the CSS class.
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""") soup.find("b", { "class" : "lime" }) # <b class="lime">Lime</b> soup.find("b", "hickory") # <b class="hickory">Hickory</b>
links = soup.find_all('a',class_='external') ,we can pass class_ to filter based on class values
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('http://www.espncricinfo.com/') as f:
raw_data= f.read()
soup= BeautifulSoup(raw_data,'lxml')
# print(soup)
links = soup.find_all('a',class_='external')
for link in links:
print(link)