Question

I use spynner for scraping data from a site. My code is this:

import spynner

br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")
text = br._get_html()

This code fails to load the entire html page. This is the html that I received:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>

<script type="text/javascript">(function(){var d=document,m=d.cookie.match(/_abs=(([or])[a-z]*)/i)
v_abs=m?m[1].toUpperCase():'N'
if(m){d.cookie='_abs='+v_abs+'; path=/; domain=.venere.com';if(m[2]=='r')location.reload(true)}
v_abp='--OO--OOO-OO-O'
v_abu=[,,1,1,,,1,1,1,,1,1,,1]})()

My question is: how do I load the complete html?

More information:

I tried with:

import spynner
br = spynner.Browser()
respond = br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews")

if respond == None:
   br.wait_load ()

but loading html is never complete or certain. What is the problem? I'm going crazy.

Again: I'm working in Django 1.3. If I use the same code in Python (2.7) sometimes load all html.

Was it helpful?

Solution

Now after you check the contents of test.html you will find the p elements with id="feedback-...somenumber..." :

import spynner

def content_ready(browser):
    if 'id="feedback-' in browser.html:
        return True

br = spynner.Browser()
br.load("http://www.venere.com/it/hotel/roma/hotel-ferrari/#reviews", wait_callback=content_ready)

with open("test.html", "w") as hf:
    hf.write(br.html.encode("utf-8"))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top