Perl로 HTML 제목을 어떻게 추출합니까?

https://stackoverflow.com/questions/574199

05-09-2019
|

문제

Perl을 사용하여 HTML 페이지 제목을 추출하는 방법이 있습니까? 양식 제출 중에 숨겨진 변수로 전달 된 다음 Perl에서 검색 할 수 있다는 것을 알고 있지만 제출 없이이 작업을 수행 할 방법이 있는지 궁금합니다.

마찬가지로, 다음과 같은 HTML 페이지가 있다고 가정 해 봅시다.

<html><head><title>TEST</title></head></html>

그리고 Perl에서 나는하고 싶습니다 :

$q -> h1('something');

'무언가'를 동적으로 포함하는 방법으로 어떻게 교체 할 수 있습니까?u003Ctitle> 태그?

해결책

나는 사용할 것이다 pquery. jQuery처럼 작동합니다.

당신은 말할 수 있습니다 :

use pQuery;
my $page = pQuery("http://google.com/");
my $title = $page->find('title');
say "The title is: ", $title->html;

물건을 교체하는 것은 비슷합니다.

$title->html('New Title');
say "The entirety of google.com with my new title is: ", $page->html;

HTML 문자열을 전달할 수 있습니다 pQuery 생성자, 원하는 것처럼 들립니다.

마지막으로, 임의의 HTML을 "템플릿"으로 사용하고 Perl 명령을 사용하여 "정제"하려면 사용하려면 템플릿 :: 정제.

다른 팁

html :: 헤드 파서 당신을 위해 이것을합니다.

당신이 무엇을 요구하는지는 나에게 명확하지 않습니다. 사용자의 브라우저에서 실행될 수있는 것에 대해 이야기하거나 적어도 이미 HTML 페이지가로드 된 것들에 대해 이야기하는 것 같습니다.

만약 그렇다면 ~ 아니다 사건은 대답입니다 URI :: 제목.

use strict;
use LWP::Simple;

my $url = 'http://www.google.com'|| die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

이전 답변은 잘못된 것입니다. HTML 제목 태그가 더 자주 사용되면 제목 태그가 유효한지 확인하여 쉽게 극복 할 수 있습니다 (그 사이에 태그가 없음).

my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;

제목 이름이 파일을 얻습니다.

                    my $spool = 0;

                    open my $fh, "<", $absPath or die $!; 
                    #open ($fh, "<$tempfile" );
                    # wrtie the opening brace
                    print WFL "[";
            while (<$fh>) {
                    # removes the new line from the line read
                        chomp;
                    # removes the leading and trailing spaces.
                    $_=~ s/^\s+|\s+$//g;
            # case where the <title> and </title> occures in one line
            # we print and exit in one instant
                if (($_=~/$startstring/i)&&($_=~/$endstring/i)) {

                        print WFL "'";

                    my ($title) = $_=~ m/$startstring(.+)$endstring/si;
                        print WFL "$title";
                        print WFL "',";
                        last;
                        }
            # case when the <title> is in one line and </title> is in other line

            #starting <title> string is found in the line
                elsif ($_=~/$startstring/i) {

                        print WFL "'";
            # extract everything after <title> but nothing before <title>       
                    my ($title) = $_=~ m/$startstring(.+)/si;
                        print WFL "$title";
                        $spool = 1;
                        }
            # ending string </title> is found
                elsif ($_=~/$endstring/i) {
            # read everything before </title> and nothing above that                                
                    my ($title) = $_=~ m/(.+)$endstring/si;
                        print WFL " ";
                        print WFL "$title";
                        print WFL "',";
                        $spool = 0;
                        last;
                        }
            # this will useful in reading all line between <title> and </title>
                elsif ($spool == 1) {
                        print WFL " ";
                        print WFL "$_";

                        }

                    }
        close $fh;
        # end of getting the title name

페이지 제목을 추출하려면 정규 표현식을 사용할 수 있습니다. 나는 그것이 다음과 같을 것이라고 믿는다.

my ($title) = $html =~ m/<title>(.+)<\/title>/si;

HTML 페이지가 문자열에 저장되어 있습니다 $html. ~ 안에 si,, s 의지합니다 단일 라인 모드 (즉, 도트는 Newline과도 일치합니다) 및 i ~을 위한 사건을 무시하십시오.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow