Question

While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.

All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.

Has anyone had experience with this, or just a general thought?

Edit This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?

Was it helpful?

Solution

Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!

WWW::Mechanize, or Mech for short, helps you automate interaction with a website. It supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited.

OTHER TIPS

The LWP Module in perl should give you what you're after.

There's a good article here which talks about enabling cookies and other authentication methods to get you an authorised login and allow your screen scrape to get you behind the log-in wall.

There are 2 types of authentication that are regularly used. HTTP-based authentication and form-based authentication.

For a site that uses HTTP based authentication you basically send the username and password as part of each HTTP request you make to the server.

For a site that does form-based authentication you usually need to visit the login page, accept and store the cookie, then submit the cookie information with any HTTP requests you make.

Of course there are also sites like stackoverflow that use external authentication like openid, or saml for authentication. These are more complex to deal with for scrapping. Usually you want to find a library to handle them.

Yes, you can use other libraries for your own language if it other than asp.net.

For example, in Java you can use httpclient or httpunit (that even handles some basic Javascript).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top