EDIT
It seems that we missed the real question, how to post forms. Oops sorry about that.
Posting forms depends on which version of Mojolicious you are using. Until recently (v3.85 -- 2013-02-13) there was a post_form
method. On reflection however, it was decided there should either be *_form
methods for every request type, or we should do something smarter, and thus the form
generator was born.
$response_vt = $ua->post(
$url,
form => {'apikey' => $key, 'resource' => $md5},
sub { ... }
);
It can be added to any request method, making it much more consistent than the old form. Also note that it should be a hashref, not an arrayref as LWP allows. BTW there is also a json
generator that works like this too, or you can even add your own!
I'm leaving my original answer, showing non-blocking usage, which you may now amend given the above.
ORIGINAL
Building off the logic from creaktive, this is how I would start. The major difference is that there isn't a monitor watching to be sure that there are works going, rather when one finishes it checks to be sure that there are no idlers.
I have also made some changes in the parsing logic, but nothing major.
#!/usr/bin/env perl
use Mojo::Base -strict;
use utf8::all;
use Mojo::URL;
use Mojo::UserAgent;
# FIFO queue
my @urls = qw(
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
);
# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
->new(max_redirects => 5)
->detect_proxy;
start_urls($ua, \@urls, \&get_callback);
sub start_urls {
my ($ua, $queue, $cb) = @_;
# Limit parallel connections to 4
state $idle = 4;
state $delay = Mojo::IOLoop->delay(sub{say @$queue ? "Loop ended before queue depleated" : "Finished"});
while ( $idle and my $url = shift @$queue ) {
$idle--;
print "Starting $url, $idle idle\n\n";
$delay->begin;
$ua->get($url => sub{
$idle++;
print "Got $url, $idle idle\n\n";
$cb->(@_, $queue);
# refresh worker pool
start_urls($ua, $queue, $cb);
$delay->end;
});
}
# Start event loop if necessary
$delay->wait unless $delay->ioloop->is_running;
}
sub get_callback {
my ($ua, $tx, $queue) = @_;
# Parse only OK HTML responses
return unless
$tx->res->is_status_class(200)
and $tx->res->headers->content_type =~ m{^text/html\b}ix;
# Request URL
my $url = $tx->req->url;
say "Processing $url";
parse_html($url, $tx, $queue);
}
sub parse_html {
my ($url, $tx, $queue) = @_;
state %visited;
my $dom = $tx->res->dom;
say $dom->at('html title')->text;
# Extract and enqueue URLs
$dom->find('a[href]')->each(sub{
# Validate href attribute
my $link = Mojo::URL->new($_->{href});
return unless eval { $link->isa('Mojo::URL') };
# "normalize" link
$link = $link->to_abs($url)->fragment(undef);
return unless grep { $link->protocol eq $_ } qw(http https);
# Don't go deeper than /a/b/c
return if @{$link->path->parts} > 3;
# Access every link only once
return if $visited{$link->to_string}++;
# Don't visit other hosts
return if $link->host ne $url->host;
push @$queue, $link;
say " -> $link";
});
say '';
return;
}