How can I extract or change links in HTML with Perl?

https://stackoverflow.com/questions/362000

21-08-2019
|

Question

I have this input text:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">   <tbody><tr>     <td><table cellspacing="0" cellpadding="0" border="0" width="603">       <tbody><tr>         <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>         <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>       </tr>     </tbody></table></td>   </tr>   <tr>     <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">       <tbody><tr>         <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>       </tr>       <tr>         <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>         <td width="580"><p>&nbsp;what y all heard?</p><p>i'm shark oysters.</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p></td>         <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>       </tr>       <tr>         <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>       </tr>     </tbody></table></td>   </tr> </tbody></table> <p>&nbsp;</p></body></html>

As you can see, there's no newline in this chunk of HTML text, and I need to look for all image links inside, copy them out to a directory, and change the line inside the text to something like ./images/file_name.

Currently, the Perl code that I'm using looks like this:

my ($old_src,$new_src,$folder_name);
    foreach my $record (@readfile) {
        ## so the if else case for the url replacement block below will be correct
        $old_src = "";
        $new_src = "";
        if ($record =~ /\<img(.+)/){
            if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                push (@images, $new_src);
                $folder_name = "images";
            }## end if
        }
        elsif($record =~ /background=\"(.+\.jpg)/){
            $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
            $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
            push (@images, $new_src);
            $folder_name = "images";
        }
        elsif($record=~/\<iframe(.+)/){
            if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                ## remove the ?rand behind the html file name
                if($new_src=~/\?rand/){
                    my ($fname,$rand) = split(/\?/,$new_src);
                    $new_src = $fname;
                    my ($fname,$rand) = split(/\?/,$old_src);
                    $old_src = $fname."\\?".$rand;
                }
        print "old_src::$old_src\n"; ##s7test
        print "new_src::$new_src\n\n"; ##s7test
                push (@iframes, $new_src);
                $folder_name = "iframes";
            }## end if
        }## end if

        my $new_record = $record;
        if($old_src && $new_src){
            $new_record =~ s/$old_src/$new_src/ ;
    print "new_record:$new_record\n"; ##s7test
            my @tmp = split(/\//,$new_src);
            $new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/;
##  print "new_record2:$new_record\n\n"; ##s7test
        }## end if
        print WRITEFILE $new_record;
    } # foreach

This is only sufficient to handle HTML text with newlines in them. I thought only looping the regex statement, but then i would have to change the matching line to some other text.

Do you have any idea if there an elegant Perl way to do this? Or maybe I'm just too dumb to see the obvious way of doing it, plus I know putting global option doesn't work.

thanks. ~steve

Solution

If you must avoid any additional module, like an HTML parser, you could try:

while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
  $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
                    $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
  if($new_src=~/\?rand/){
    // remove rand and push in @iframes
  else
  {
    // push into @images
  }
}

That way, you would apply this regex on all the source (newlines included), and have a more compact code (plus, you would take into account any extra space between attributes and their values)

OTHER TIPS

There are excellent HTML parsers for Perl, learn to use them and stick with that. HTML is complex, allows > in attributes, heavily use nesting, etc. Using regexes to parse it, beyond very simple tasks (or machine generated code), is prone to problems.

I think you want my HTML::SimpleLinkExtor module:

use HTML::SimpleLinkExtor;

my $extor = HTML::SimpleLinkExtor->new;
$extor->parse_file( $file );

my @imgs = $extor->img;

I'm not sure what exactly you're trying to do, but it surely sounds like one of the HTML parsing modules should do the trick if mine doesn't.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow