Como posso extrair ou alterar links em html com o Perl?

https://stackoverflow.com/questions/362000

21-08-2019
|

Pergunta

Eu tenho este texto de entrada:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">   <tbody><tr>     <td><table cellspacing="0" cellpadding="0" border="0" width="603">       <tbody><tr>         <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>         <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>       </tr>     </tbody></table></td>   </tr>   <tr>     <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">       <tbody><tr>         <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>       </tr>       <tr>         <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>         <td width="580"><p>&nbsp;what y all heard?</p><p>i'm shark oysters.</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p></td>         <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>       </tr>       <tr>         <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>       </tr>     </tbody></table></td>   </tr> </tbody></table> <p>&nbsp;</p></body></html>

Como você pode ver, não há nova linha neste pedaço de texto HTML, e eu preciso procurar todos os links de imagem, copiá -los para um diretório e alterar a linha dentro do texto para algo como ./images/file_name.

Atualmente, o código PERL que estou usando se parece isto:

my ($old_src,$new_src,$folder_name);
    foreach my $record (@readfile) {
        ## so the if else case for the url replacement block below will be correct
        $old_src = "";
        $new_src = "";
        if ($record =~ /\<img(.+)/){
            if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                push (@images, $new_src);
                $folder_name = "images";
            }## end if
        }
        elsif($record =~ /background=\"(.+\.jpg)/){
            $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
            $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
            push (@images, $new_src);
            $folder_name = "images";
        }
        elsif($record=~/\<iframe(.+)/){
            if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                ## remove the ?rand behind the html file name
                if($new_src=~/\?rand/){
                    my ($fname,$rand) = split(/\?/,$new_src);
                    $new_src = $fname;
                    my ($fname,$rand) = split(/\?/,$old_src);
                    $old_src = $fname."\\?".$rand;
                }
        print "old_src::$old_src\n"; ##s7test
        print "new_src::$new_src\n\n"; ##s7test
                push (@iframes, $new_src);
                $folder_name = "iframes";
            }## end if
        }## end if

        my $new_record = $record;
        if($old_src && $new_src){
            $new_record =~ s/$old_src/$new_src/ ;
    print "new_record:$new_record\n"; ##s7test
            my @tmp = split(/\//,$new_src);
            $new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/;
##  print "new_record2:$new_record\n\n"; ##s7test
        }## end if
        print WRITEFILE $new_record;
    } # foreach

Isso é suficiente apenas para lidar com o texto HTML com as novas linhas. Eu pensei apenas em loop da declaração Regex, mas então eu teria que alterar a linha correspondente para algum outro texto.

Você tem alguma ideia se houver uma maneira elegante de Perl de fazer isso? Ou talvez eu seja burro demais para ver a maneira óbvia de fazê -lo, além de saber que colocar a opção global não funciona.

obrigado. ~ Steve

Solução

Se você precisar evitar qualquer módulo adicional, como um analisador HTML, você pode tentar:

while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
  $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
                    $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
  if($new_src=~/\?rand/){
    // remove rand and push in @iframes
  else
  {
    // push into @images
  }
}

Dessa forma, você aplicaria esse regex em toda a fonte (incluídos no Newlines) e teria um código mais compacto (além disso, você levaria em consideração qualquer espaço extra entre atributos e seus valores)

Outras dicas

Existem excelentes analisadores HTML para Perl, aprenda a usá -los e a manter isso. O HTML é complexo, permite> em atributos, use fortemente o ninho, etc. O uso de regexes para analisá -lo, além de tarefas muito simples (ou código gerado pela máquina), é propenso a problemas.

Eu acho que você quer meu Html :: simplelinkextor módulo:

use HTML::SimpleLinkExtor;

my $extor = HTML::SimpleLinkExtor->new;
$extor->parse_file( $file );

my @imgs = $extor->img;

Não tenho certeza do que exatamente você está tentando fazer, mas certamente parece que um dos módulos de análise HTML deve fazer o truque se o meu não.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow