Question

I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data.

Is there some new technology useful for extracting information?

Was it helpful?

Solution

What I've done in the past is use Selenium RC to control a web browser (usually firefox) from code to load and parse websites using a real web browser.

The cool thing about this is that you're mostly coding in a language you're comfortable with be it Perl or Ruby or C#. But to fully use the power of Selenium you still need to know and write javascript.

OTHER TIPS

Another interesting way to do this is to use node.js in conjunction with jsdom and node-htmlparser to load a page and parse the javascript in it. It is not really working out of the box yet at the moment but Dav Glass (from Yahoo) have had success running YUI in node.js using a modified version of this combo.

This is interesting if you decide that nothing out there is good enough and you want to implement your own. If so it would make an excellent open source project.

I've had some success writing a js-enabled crawler in python + pywebkitgtk + javascript. It's much slower than a traditional crawler, but it gets the job done and can do cool stuff like make screenshots and pick up content that's been 'obscured' by js injection.

There's a decent article with some example code here:

http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

ItsNat is similar to Jaxer the main difference is Java based, instead of JavaScript.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top