Question

I've built a crawler to crawl a particular website using Scrapy. The crawler follows if a url matches the given regex and calls the callback function if url matches other defined regex. The main purpose to build the crawler was to extract all the required links within the website rather than the contents inside the link. Can anyone tell me how to print the list of all the crawled links. The code is:

name = "xyz"
allowed_domains = ["xyz.com"]
start_urls = ["http://www.xyz.com/Vacanciess"] 
rules = (Rule(SgmlLinkExtractor(allow=[regex2]),callback='parse_item'),Rule(SgmlLinkExtractor(allow=[regex1]), follow=True),)



def parse_item(self, response):
 #sel = Selector(response)

 #title = sel.xpath("//h1[@class='no-bd']/text()").extract()
 #print title
 print response

The

print title 

code works perfectly well. But as in the above code if i try t print the actual response, it returns me:

[xyz] DEBUG: Crawled (200)<GET http://www.xyz.com/urlmatchingregex2> (referer:  http://www.xyz.com/urlmatchingregex1)
<200 http://www.xyz.com/urlmatchingregex2>

Anyone please help me to retrieve the actual url.

Was it helpful?

Solution

You can print response.url in parse_item method to print the url crawled. It is documented here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top