سؤال

As I'm not sure if the issue I'm facing is a bug or lack of knowledge from my side, I would like to ask for you assistance.

The case is, when trying to parse this url (http://ies.ieee-ies.org/resources/media/publications/TIEpub/1988_2013.htm) using PyQuery, apparently it Loads only the title, and the body is ignored:

>>> import urllib2
>>> from pyquery import PyQuery as pq

>>> response = urllib2.urlopen('http://ies.ieee-ies.org/resources/media/publications/TIEpub/1988_2013.htm').read() # 9MB page
>>> len(response)
9835026
>>> dom = pq(response)
>>> dom.html()
u'<head><title>IEEE Transactions on Industrial Electronics</title></head><body><h1 align="center">&#13;\n   <img border="0" src="ieeelogo.gif"/><font color="#FF6600">\xa0IEEE Tr
ansactions on Industrial Electronics\xa0&#13;\n   <img border="0" src="ieslogo.gif"/></font>&#13;\n   </h1><h2 align="center">&#13;\n   Volume 35, \xa0Number 1, Feb 1988 \xa0\xa
0\xa0\xa0\xa0\xa0\xa0\xa0\xa0&#13;\n   <a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=41"><font size="4">Access to the journal on IEEE XPLORE</font></a><font s
ize="4"> </font>\xa0\xa0\xa0&#13;\n   <a href="http://tie.ieee-ies.org/"><font size="3">IE Transactions Home Page</font></a><font size="4"> </font> &#13;\n   </h2><hr/><br/><br/
></body>'

Is there a size limit for HTML parsing on PyQuery that I'm not aware of?

PS: I have a work around using different pages which leads to the same content, but I would like to know what is the reason for this.

هل كانت مفيدة؟

المحلول

I'm pretty sure that the problem is not the size, but that the HTML of this page is very broken. It has more than 2000 <html> tags in it, for instance (the correct number is one). I'm shocked that a browser can make any sense of it whatsoever, but the Mozilla devs have a lot of experience with that kind of thing, and I imagine that the PyQuery devs, though undoubtedly diligent, probably have much less. If you can get the content from a different page, then by all means do that, especially if that page is less broken.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top