Question

I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.

I know how to screen-scrape normal pages, but not those behind a secure site.

  1. Can this be done with the .NET WebClient class?
    • How would I automatically login?
    • How would I keep logged in for the other pages?
Was it helpful?

Solution

One way would be through automating a browser -- you mentioned WebClient, so I'm guessing you might be referring to WebClient in .NET.

Two main points:

  • There's nothing special about https related to WebClient - it just works
  • Cookies are typically used to carry authentication -- you'll need to capture and replay them

Here's the steps I'd follow:

  1. GET the login form, capture the the cookie in the response.
  2. Using Xpath and HtmlAgilityPack, find the "input type=hidden" field names and values.
  3. POST to login form's action with user name, password, and hidden field values in the request body. Include the cookie in the request headers. Again, capture the cookie in the response.
  4. GET the pages you want, again, with the cookie in the request headers.

On step 2, I mention a somewhat complicated method for automating the login. Usually, you can post with username and password directly to the known login form action without getting the initial form or relaying the hidden fields. Some sites have form validation (different from field validation) on their forms which makes this method not work.

HtmlAgilityPack is a .NET library that allows you to turn ill-formed html into an XmlDocument so you can XPath over it. Quite useful.

Finally, you may run into a situation where the form relies on client script to alter the form values before submitting. You may need to simulate this behavior.

Using a tool to view the http traffic for this type of work is extremely helpful - I recommend ieHttpHeaders, Fiddler, or FireBug (net tab).

OTHER TIPS

You can easily simulate user input. You can submit form on the web page from you program by sending post\get request to a website.
Typical login form looks like:

<form name="loginForm" method="post" Action="target_page.html">
   <input type="Text" name="Username">
   <input type="Password" name="Password">
</form>

You can send a post request to the website providing values for Username & Password fields. What happens after you send your request is largely depends on a website, usually you will be redirected to some page. You authorization info will be stored in the sessions\cookie. So if you scrape client can maintain web session\understands cookies you will be able to access protected pages.

It's not clear from your question what language\framework you're going to use. For example there is a framework for screen scraping (including login functionality) written in perl - WWW::Mechanize

Note, that you can face some problems if site you're trying to login to uses java scripts or some kind of CAPTCHA.

Can you please clarify? Is the WebClient class you speak of the one in HTTPUnit/Java?

If so, your session should be saved automatically.

It isn't clear from your question which WebClient class (or language) you are referring to.

If have a Java Runtime you can use the Apache HttpClient class; here's an example I wrote using Groovy that accesses the delicious API over SSL:

   def client = new HttpClient()

   def credentials = new UsernamePasswordCredentials( "username", "password" )
   def authScope = new AuthScope("api.del.icio.us", 443, AuthScope.ANY_REALM)
   client.getState().setCredentials( authScope, credentials )

   def url = "https://api.del.icio.us/v1/posts/get"

   def method = new PostMethod( url )
   method.addParameter( "tag", tag )
   client.executeMethod( method )
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top