Question

I have a simple python crawler / spider that searches for a specified text on a site that i provide. But in some sites it crawls normally for 2-4 sec until an error is occurred.

The code so far:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import requests, pyquery, urlparse

try:
    range = xrange
except NameError:
    pass

def crawl(seed, depth, terms):

    crawled = set()
    uris = set([seed])
   for level in range(depth):
       new_uris = set()
       for uri in uris:
          if uri in crawled:
               continue
          crawled.add(uri)
          # Get URI contents
          try:
              content = requests.get(uri).content
         except:
             continue
         # Look for the terms
         found = 0
         for term in terms:
             if term in content:
                 found += 1
          if found > 0:
              yield (uri, found, level + 1)
          # Find child URIs, and add them to the new_uris set
          dom = pyquery.PyQuery(content)
          for anchor in dom('a'):
              try:
                 link = anchor.attrib['href']
             except KeyError:
                    continue
                new_uri = urlparse.urljoin(uri, link)
                new_uris.add(new_uri)
        uris = new_uris

if __name__ == '__main__':
    import sys
    if len(sys.argv) < 4:
        print('usage: ' + sys.argv[0] + 
            "start_url crawl_depth term1 [term2 [...]]")
       print('       ' + sys.argv[0] + 
           " http://yahoo.com 5 cute 'fluffy kitties'")
       raise SystemExit

 seed_uri = sys.argv[1]
 crawl_depth = int(sys.argv[2])
 search_terms = sys.argv[3:]

 for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
     print(uri)

Now let's say that i want to find all the pages that have the "requireLazy(" in their source. Let's try it with facebook, if i execute this:

python crawler.py https://www.facebook.com 4 '<script>requireLazy('

It will run fine for 2-4 sec and this error will occur:

https://www.facebook.com
https://www.facebook.com/badges/?ref=pf
https://www.facebook.com/appcenter/category/music/?ref=pf
https://www.facebook.com/legal/terms
https://www.facebook.com/
...

Traceback (most recent call last):
  File "crawler.py", line 61, in <module>
   for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
File "crawler.py", line 38, in crawl
  dom = pyquery.PyQuery(content)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 226, in __init__
  elements = fromstring(context, self.parser)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 70, in fromstring
  result = getattr(lxml.html, meth)(context)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
   doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
  value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
 File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827) lxml.etree.XMLSyntaxError: line 21: Tag fb:like invalid

Can anyone help me fix this error? Thanks.

Was it helpful?

Solution

Seems that the page content you are trying to parse has some invalid tags. Normally the best you could do is to catch and log this kind of errors and gracefully advance to the next pages.

Hopefully you could use BeautifulSoup to extract the URLs of the next pages to be crawled and it will handle the most of the bad content gracefully. You can find more details about BeatifulSoup and how to use it here.

UPDATE

Actually after playing around with the crawler it seems that at some point the page content is empty so the parser fails to load the document.

I tested the crawler with BeautifoulSoup and it's working properly. If you need/want I can share you my updated version.

You can easily add a check for empty contents, but I'm not sure what other edge cases you could encounter, so switching to BeautifulSoup seems like a safer approach.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top