Question

I'm using Scrapy to scrape some gold that's behind an authentication screen. The website uses ASP.net and ASP's got some stupid hidden fields littered all over the form (like __VIEWSTATE, __EVENTTARGET).

When I call FormRequest.from_response(response,... I'm expecting that it reads these hidden fields automatically from the response and populates them in the formdata dictionary - which is what Scrapy's FormRequest documentation says it should do.

But if that's the case, then why does the login process only work when I explicitly list these fields and populate them?

class ItsyBitsy(Spider):
    name = "itsybitsy"
    allowed_domains = ["website.com"]
    start_urls = ["http://website.com/cpanel/Default.aspx"]

    def parse(self, response):
        # Performs authentication to get past the login form
        sel = Selector(response)
        return [FormRequest.from_response(response,
        formdata={
        'tb_Username':'admin',
        'tb_Password':'password',

        # The following fields should be auto populated, right?
        # So why does removing 'em break the login (w/500 Server Error)
        '__VIEWSTATE':
              sel.xpath("//input[@name='__VIEWSTATE']/@value").extract(),
        '__EVENTVALIDATION':
              sel.xpath("//input[@name='__EVENTVALIDATION']/@value").extract(),
        '__EVENTTARGET': 'b_Login'

        },
        callback=self.after_login,
        clickdata={'id':'b_Login'},
        dont_click=True)]

    def after_login(self, response):
        # Mmm, scrumptious
        pass

Edit: Adding form HTML

<form id="form1" action="Default.aspx" method="post" name="form1">
<div>
<input type="hidden" value="" id="__EVENTTARGET" name="__EVENTTARGET">
<input type="hidden" value="" id="__EVENTARGUMENT" name="__EVENTARGUMENT">
<input type="hidden" value="/wEPDwULLTE2OTg2NjA1NTAPZBYCAgMPZBYGAgMPD2QWAh4Kb25rZXlwcmVzcwUlcmV0dXJuIGNsaWNrQnV0dG9uKGV2ZW50LCAnYl9Mb2dpbicpO2QCBQ8PZBYCHwAFJXJldHVybiBjbGlja0J1dHRvbihldmVudCwgJ2JfTG9naW4nKTtkAgcPD2QWAh4Hb25jbGljawUPcmV0dXJuIGxvZ2luKCk7ZGRKt/WTOQThVTxB9Y0QcIuRqylCIw==" id="__VIEWSTATE" name="__VIEWSTATE">
</div>

<script type="text/javascript">
//&lt;![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]&gt;
</script>


<div>

<input type="hidden" value="/wEWBAK0o8DDCQLxz5rcDwLF8dCIDALHyYWSA+rA4VJNaEpFIycMDHQPUOz393TI" id="__EVENTVALIDATION" name="__EVENTVALIDATION">
<input type="text" onkeypress="return clickButton(event, 'b_Login');" size="28" class="textfield-text" id="tb_Username" name="tb_Username">
<input type="password" onkeypress="return clickButton(event, 'b_Login');" size="28" class="textfield-text" id="tb_Password" name="tb_Password">
<a href="javascript:__doPostBack('b_Login','')" class="button-link" id="b_Login" onclick="return login();">Login</a>
</form>
Was it helpful?

Solution

According to the source code, Scrapy uses the following CSS selector to parse the inputs out of the form:

descendant::textarea|descendant::select|descendant::input[@type!="submit" and @type!="image" and @type!="reset"and ((@type!="checkbox" and @type!="radio") or @checked)]

In other words, all of your hidden inputs are successfully parsed (and sent with the request later) with the values equal to value attributes. So, Scrapy does what it should here.

The login using from_response() doesn't work because __EVENTTARGET has a empty value attribute. If you make the login using a real browser, __EVENTTARGET parameter value would be set to b_Login via javascript __doPostBack() function call. And, since Scrapy cannot handle javascript (cannot call js functions), __EVENTTARGET is sent with an empty value which causes login failure.

__EVENTARGUMENT has an empty value too, but it is actually set to the empty string in the __doPostBack() function, so it doesn't make a difference here.

Hope that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top