Question

I'm writing spiders with scrapy to get some data from a couple of applications using ASP. Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method.

The code of both spiders (they are almost identical but with different IPs) is as following:

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response

class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']

def parse(self,response):
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
    return [FormRequest.from_response(response,
                                      formdata={'user':'the_username',
                                                'password':'my_nice_password'},
                                      callback=self.after_login)]

def after_login(self,response):
    inspect_response(response,self) #Spider never gets here in one site
    if "Bad login" in response.body:
        print "Login failed"
        return
    #Scrapping code begins...

Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1.

As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this:

Interaction using scrapy with working webpage (IIS 6.0)

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 302 Object moved
scrapy  --> webpage GET /reporting/htm/webpage.asp
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/asp/report1.asp
...Scrapping begins

Interaction using scrapy with not working webpage (IIS 5.1)

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy  <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...

I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here.

Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? Or am I doing something wrong?

Thank you!

Edit - Console log with not working site:

2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...
Was it helpful?

Solution

Try using the HTTP 1.0 downloader:

# settings.py
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top