How can I screen-scrape the HTML result of a non-trivial user scenario

https://stackoverflow.com/questions/8066081

25-02-2021
|

Question

I want to be able to get the HTML for a page which, if I was doing it interactively in a browser, would involve multiple actions and page loads: 1. Go to homepage 2. Enter text into a login form and submit the form (post) 3. The post will go through various redirections and frameset usage.

Cookies are adapted throughout this process.

In the browser, after submitting, I just get the page.

But to do this with curl (in PHP or whatever) or wget or A.N.Other low level technology, the management of cookies, redirections and framesets all becomes quite a chore and very tightly binds my script to the website (making it very susceptible to even small changes in the website that I'm scraping from.)

Can anyone suggest a way to do this?

I've already looked at Crowbar and PhantomJS and Lynx (with cmd_log/cmd_script options) but chaining everything together to mimic exactly what I'd do in Firefox or Chrome is difficult.

(As an aside, it might even be useful/necessary for the target website to think this script is Firefox or Chrome or a "real" browser)

Solution

One way to do this is using Selenium RC. While it's usually used for testing, at it's core it's just a browser remote control service.

Use this web site as a starting point: http://seleniumhq.org/projects/remote-control/

OTHER TIPS

You can use irobot at irobotsoft to record a robot and replay it.

If you prefer low-level control, you can use HTQL python interface, see: http://htql.net/htql-python-manual.pdf. It allows you to access an IE-based browser from python.

Use a tool like Firebug to check what headers are submitted to the website for login, and then replicate that exactly in your code.

Or just login with your browser and then reuse the cookie in your code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow