Question

I have written a Perl script which would check a list of URLs and connect to them by sending a GET request.

Now, let's say that one of these URLs has a file which is very big in size, for instance, has a size > 100 MB.

When a request is sent to download this file using this:

$mech=WWW::Mechanize->new();
$url="http://somewebsitename.com/very_big_file.txt"
$mech->get($url)

Once the GET request is sent, it will start downloading the file. I want this to be cancelled using WWW::Mechanize. How can I do that?

I checked the documentation of this Perl Module here:

http://metacpan.org/pod/WWW::Mechanize

However, I could not find a method which would help me do this.

Thanks.

Was it helpful?

Solution

Aborting a GET request

Using the :content_cb option, you can provide a callback function to get() that will be executed for each chunk of response content received from the server. You can set* the chunk size (in bytes) using the :read_size_hint option. These options are documented in LWP::UserAgent (get() in WWW::Mechanize is just an overloaded version of the same method in LWP::UserAgent).

The following request will be aborted after reading 1024 bytes of response content:

use WWW::Mechanize;

sub callback {
    my ($data, $response, $protocol) = @_;

    die "Too much data";
}

my $mech = WWW::Mechanize->new;

my $url = 'http://www.example.com';

$mech->get($url, ':content_cb' => \&callback, ':read_size_hint' => 1024);

print $mech->response()->header('X-Died');

Output:

Too much data at ./mechanize line 12.

Note that the die in the callback does not cause the program itself to die; it simply sets the X-Died header in the response object. You can add the appropriate logic to your callback to determine under what conditions a request should be aborted.

Don't even fetch URL if content is too large

Based on your comments, it sounds like what you really want is to never send a request in the first place if the content is too large. This is quite different from aborting a GET request midway through, since you can fetch the Content-Length header with a HEAD request and perform different actions based on the value:

my @urls = qw(http://www.example.com http://www.google.com);

foreach my $url (@urls) {
    $mech->head($url);

    if ($mech->success) {
        my $length = $mech->response()->header('Content-Length') // 0;

        next if $length > 1024;

        $mech->get($url);
    }
}

Note that according to the HTTP spec, applications should set the Content-Length header. This does not mean that they will (hence the default value of 0 in my code example).


* According to the documentation, the "protocol module which will try to read data from the server in chunks of this size," but I don't think it's guaranteed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top