Question

I'm building a Django app and I'm using Spynner for web crawling. I have this problem and I hope someone can help me.

I have this function in the module "crawler.py":

import spynner 

def crawling_js(url)
    br = spynner.Browser()
    br.load(url)
    text_page = br.html
    br.close (*)
    return text_page

(*) I tried with br.close() too
in another module (eg: "import.py") I call the function in this way:

from crawler import crawling_js    

l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]

for url in l_url:
    mytextpage = crawling_js(url)
    .. parse mytextpage.... 

when I pass the first url in to the function all is correct when I pass the second "url" python crash. Python crash in this line:br.load(url). Someone can help me? Thanks a lot

I have: Django 1.3 Python 2.7 Spynner 1.1.0 PyQt4 4.9.1

Was it helpful?

Solution

Why you need to instantiate br = spynner.Browser() and close it every time you call crawling_js(). In a loop this will utilize a lot of resources which I think is the reason why it crashes. let's think of it like this, br is a browser instance. Therefore, you can make it browse any number of websites without the need to close it and open it again. Adjust your code this way:

import spynner

br = spynner.Browser() #you open it only once.

def crawling_js(url):
    br.load(url)
    text_page = br._get_html() #_get_html() to make sure you get the updated html
    return text_page 

then if you insist to close br later you simply do:

from crawler import crawling_js , br

l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]

for url in l_url:
    mytextpage = crawling_js(url)
    .. parse mytextpage....

br.close()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top