scrapy - Get final redirected URL

https://stackoverflow.com/questions/12769994

05-07-2021
|

Question

I am trying to get the final redirected URL in scrapy. For example, if an anchor tag has a specific format:

<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />

Then I need to get the URL that URL redirects to (if it does, if its 200 then OK). For example, I get the appropriate anchor tags like this:

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = get_final_url (anchor);   // << I would need something like this

        // Save final_url

So if I visited http://www.example.com/index.php and that would send me through 10 redirects and finally it would stop at http://www.example.com/final.php - this is what I would need get_final_url() to return.

I thought of hacking my way to a solution but am asking here to see if scrapy has one already provided?

Solution

Again, assuming anchor contains an actual URL, I went and accomplished it with urllib2:

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = urllib2.open(anchor, None, 1).geturl()

        // Save final_url

urllib2.open() returns a file-like object with two additional methods, one of them being geturl() which returns the final URL (after all redirects have been followed). Its not part of Scrapy, but it works.

OTHER TIPS

I use response.headers that will return a list of information. The new url value is next to the "Location" key.

In [1]: response.headers
Out[1]: 
{'Date': 'Thu, 09 Jun 2016 00:18:18 GMT',
 'Location': 'https:/www.protiviti.com/en-US/Pages/default.aspx',
 'Server': 'nginx/1.9.1',
 'X-Ms-Invokeapp': '1; RequireReadOnly'}

It's pretty simple:

print response.url #(inside parse() )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow