Script to build HTML page from from extracted DIVs from other HTML pages

https://stackoverflow.com/questions/1211718

06-07-2019
|

Question

I have a set of HTML reports that each contain two DIV elements with specific IDs that I need to strip out and compile into an overall summary report (again, an HTML file).

My initial thoughts are that this is an ideal job for a Perl script, however we have no up-to-date in-house Perl skills (we're a .NET C# shop).

Thoughts and suggestions on recommended approaches would be welcomed...

Solution

Use a suitable HTML parser; there's HTML::Parser for Perl and I'm sure there's several for C# as well.

OTHER TIPS

Using Perl, HTML::TokeParser and HTML::Template can help. Here is a quick example:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;
use HTML::Template;

use Data::Dumper;

my ($html_file) = @ARGV;

open my $html_handle, '<:utf8', $html_file
    or die "Cannot open '$html_file': $!";

my $parser = HTML::TokeParser->new( $html_handle );

my @divs;

while ( my $tag = $parser->get_tag('div') ) {
    my $attr = $tag->[1];
    next unless ref $attr eq 'HASH';
    next unless defined( my $id = $attr->{id} );
    next unless $id eq 'div1' or $id eq 'div2';

    my $div = $tag->[-1];
    my $in_wanted = 1;

    while ( $in_wanted ) {
        my $token = $parser->get_token;
        if ( $token->[0] eq 'T' ) {
            $div .= $token->[1];
        }
        else {
            $div .= $token->[-1];
        }
        my ($type, $name) = @$token[0, 1];
        if ( $name eq 'div' ) {
            $in_wanted += $type eq 'S' ?  1
                        : $type eq 'E' ? -1
                        : 0;
            next;
        }
        if ( $type eq 'E' and $name eq 'html' ) {
            warn "Warning: Reached the end of '$html_file'\n";
            last;
        }
    }

    push @divs, {DIV => $div};
}

print output( @divs );

sub output {
    my $tmpl_html = <<EO_TMPL;
<html>
<body>
<TMPL_LOOP DIVS>
    <TMPL_VAR DIV>
</TMPL_LOOP>
</body>
</html>
EO_TMPL
    my $tmpl = HTML::Template->new(
        scalarref => \$tmpl_html,
    );
    $tmpl->param( DIVS => \@_ );
    return $tmpl->output;
}

Straight-forward regular expressions may not be enough if your div contains nested divs. This is because the closing div element doesn't contain the ID, so it is hard for a regexp to match the closing tag.

If your div is:

<div id="findme">
    <!-- No other divs here! -->
</div>

Then you could use a regular expression (just be careful about greediness), a more elegant version of this:

<div id="findme">(.*?)</div>

note: Im pretty sure that regexp won't run, it has been a while!

I would look into using a HTML parser library to parse the structure and obtain character offsets for the inside of the div, and then take that range from the buffer. Using a HTML library will allow you to parse and find where the div you want ends.

Something like this tutorial might be useful. These parsers will probably allow you to extract the data enclosed in a tag such as your div accurately.

You can also use a C# HTML parser, they all do a similar job, Just look through the documentation to ensure they don't just built trees, and allow you to obtain character offsets for the enclosed div data (so you can extract it) or allow access to that data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow