findAll() in BeautifulSoup missing nodes

https://stackoverflow.com/questions/17707710

03-06-2022
|

Question

The method findAll() in BeautifulSoup does not return all elements in XML. If you look the code below and open URL, you can see that there are 10 PubmedArticle nodes in XML. However the findAll method only finds 6 of them. There is only 6 * on the output instead of 10. What am I doing wrong?

import urllib2
from bs4 import BeautifulSoup

URL = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id=23858559,23858558,23858557,23858521,23858508,23858506,23858494,23858473,23858461,23858404'
data = urllib2.urlopen(URL).read()

soup = BeautifulSoup(data)

for x in soup.findAll('pubmedarticle'):
    print '*'

Solution 2

I solved this by adding xml argument. Make sure you have lxml installed.

soup = BeautifulSoup(xmlData, 'xml')

OTHER TIPS

Edit: I've discovered that 'findAll' is relative to the current node, you can set the root node with soup.

The entities in the provided xml are named "PubMedArticle", so try with the following:

for x in soup.pubmedarticleset.findAll('pubmedarticle'):
    print '*'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow