Question

I am trying to write a Python-based Web Bot that can read and interpret an HTML page, then execute an onClick function and receive the resulting new HTML page. I can already read the HTML page and I can determine the functions to be called by the onClick command, but I have no idea how to execute those functions or how to receive the resulting HTML code.

Any ideas?

Was it helpful?

Solution

The only tool in Python for Javascript, that I am aware of is python-spidermonkey. I have never used it though.

With Jython you could (ab-)use HttpUnit.

Edit: forgot that you can use Scrapy. It supports Javascript through Spidermonkey, and you can even use Firefox for crawling the web.

Edit 2: Recently, I find myself using browser automation more and more for such tasks thanks to some excellent libraries. QtWebKit offers full access to a WebKit browser, which can be used in Python thanks to language bindings (PySide or PyQt). There seem to be similar libraries and bindings for Gtk+ which I haven't tried. Selenium WebDriver API also works great and has an active community.

OTHER TIPS

Well obviously python won't interpret the JS for you (though there may be modules out there that can). I suppose you need to convert the JS instructions to equivalent transformations in Python.

I suppose ElementTree or BeautifulSoup would be good starting points to interpret the HTML structure.

To execute JavaScript, you need to do much of what a full web browser does, except for the rendering. In particular, you need a JavaScript interpreter, in addition to the Python interpreter.

One starting point might be python-spidermonkey. Depending on the specific JavaScript, you might have to provide a good DOM API to the spidermonkey, in addition to providing an XmlHttpRequest implementation.

You can try to leverage V8,

V8 is Google's open source, high performance JavaScript engine. It is written in C++ and is used in Google Chrome, Google's open source browser.

Calling it from Python may not be straightforward, without a framework to provide the DOM. Pyjamas has an experimental project, Pyjamas Desktop, providing V8 integration for Javascript execution.

Pyv8 is an experimental python v8 bindings and a python-javascript compiler.

For the browser part of this you might want to look into Mechanize, which basically is a webbrowser implemented as a Python library. http://pypi.python.org/pypi/mechanize/0.1.11 But as mentioned, the text n onClick is Javascript, and you'll need spidermonkey for that.

If you can make a generic support for spidermonkey in mechanize, I'm sure many people would be extremely happy. ;)

Mechanize may be overkill, maybe you just want to find specific parts of the HTML, and then lxml and BeautifulSoup both work well.

Why don't you just sniff what gets sent after the onclick event and replicate that with your bot?

For web automation , you can look into "webbot" library. It makes autmation damn simple and pain free.

webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.

Here's a snippet of webbot

from webbot import Browser 
web = Browser()
web.go_to('google.com') 
web.click('Sign in')
web.type('mymail@gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^

Docs are at : https://webbot.readthedocs.io

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top