How does one -- in Perl -- stream a list of URLs from a file into an array to then recursively acquire all of their HTML data in a single file?

StackOverflow https://stackoverflow.com/questions/22160171

Question

Another laborious title... Sorry... Anyway, I've got a file called mash.txt with a bunch of URLs like this in it:

http://www...

http://www...

http://www...

.

.

.

So, at this point, I'd like to feed these (URLs) into an array--possibly without having to declare anything along the way--to then recursively suck up the HTML data from each one and append it all to the same file--which I guess will have to be created... Anyhow, thanks in advance.


Actually, to be completely forthcoming, by design I'd like to match the values (value) under the option tags in each HTML tag to this document, so I don't have all that garbage... That is, each of these

http://www...

will produce something like this

<!DOCTYPE html>
<HTML>
   <HEAD>
      <TITLE>
         DATA! 
      </TITLE>
   </HEAD>
<BODY>
.
.
.

All I want out of all of these is the value name under the option tag that occurs in each HTML in this mash.txt.

Was it helpful?

Solution

The following fetches the HTML content for each URL in mash.txt, retrieves all values across all options, and pushes them into a single array. The resultant array is then passed to input.template, and the processed output is written to output.html:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
use Template;

my %values;
my $input_file     = 'mash.txt';
my $input_template = 'input.template';
my $output_file    = 'output.html';

# create a new lwp user agent object (our browser).
my $ua = LWP::UserAgent->new( );

# open the input file (mash.txt) for reading.
open my $fh, '<', $input_file or die "cannot open '$input_file': $!";

# iterate through each line (url) in the input file.
while ( my $url = <$fh> )
{
    # get the html contents from url. It returns a handy response object.
    my $response = $ua->get( $url );

    # if we successfully got the html contents from url.
    if ( $response->is_success ) 
    {
        # create a new html tree builder object (our html parser) from the html content.
        my $tb = HTML::TreeBuilder->new_from_content( $response->decoded_content );

        # fetch values across options and push them into the values array.
        # look_down returns an array of option node objects, which we translate to the value of the value attribute via attr upon map.
        $values{$_} = undef for ( map { $_->attr( 'value' ) } $tb->look_down( _tag => 'option' ) );
    }
    # else we failed to get the html contents from url.
    else 
    {
        # warn of failure before next iteration (next url).
        warn "could not get '$url': " . $response->status_line;
    }
}

# close the input file since we have finished with it.
close $fh;

# create a new template object (our output processor).
my $tp = Template->new( ) || die Template->error( );

# process the input template (input.template), passing in the values array, and write the result to the output file (output.html).
$tp->process( $input_template, { values => [ keys %values ] }, $output_file ) || die $tp->error( );

__END__

input.template could look something like:

<ul>
[% FOREACH value IN values %]
    <li>[% value %]</li>
[% END %]
</ul>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top