Question

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup). In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.

The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot. It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.

As for the bases.

But the trick is that the link I get from the source is not the link Webarchive has. Webarchive's links to snapshots go like this :

http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost

The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.

When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it. But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.

So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.

Here is my code :

sub main{
    my ($url) = @_;
    my $ua = Mojo::UserAgent->new;
    $ua = $ua->max_redirects(5);
    my $dom = $ua->get($url)->res->dom;

    #...Treatment and parsing of the source ...
    return $nextUrl;
}

my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;

while ($nextUrl){
    $secondUrl = main($nextUrl);
    $nextUrl = $secondUrl;
}

Thanks in advance...

Was it helpful?

Solution

I've finally found a way around. I use this piece of code to follow the URL and get the finally reached URL :

use LWP::UserAgent qw();    
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";    
print "URL returned: ".$url."\n";

Then I use that URL to get the source code and fetch it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top