Preg_match_all <a href

https://stackoverflow.com/questions/1519696

19-09-2019
|

문제

안녕하세요, 링크를 추출하고 싶습니다<a href="/portal/clients/show/entityId/2121" >그리고 나는 Me/Portal/Clients/show/entityid/2121에 대한 regex를 원합니다. 마지막 2121의 숫자는 다른 링크에 있습니다.

해결책

링크 구문 분석을 위한 정규식은 다음과 같습니다.

'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'

그것이 얼마나 끔찍한지를 감안할 때 다음을 사용하는 것이 좋습니다. 간단한 HTML 돔 최소한 링크를 얻으려면.그런 다음 링크 href에서 매우 기본적인 정규식을 사용하여 링크를 확인할 수 있습니다.

다른 팁

간단한 PHP HTML DOM 파서 예시:

// Create DOM from string
$html = str_get_html($links);

//or
$html = file_get_html('www.example.com');

foreach($html->find('a') as $link) {
    echo $link->href . '<br />';
}

XML/HTML을 처리하기 위해 정규 표현식을 사용하지 마십시오. 이것은 매우 쉽게 사용 할 수 있습니다 Dom Parser:

$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a/@href');
for ($i = 0; $i < $nodeList->length; $i++) {
    # Xpath query for attributes gives a NodeList containing DOMAttr objects.
    # http://php.net/manual/en/class.domattr.php
    echo $nodeList->item($i)->value . "<br/>\n";
}

"파싱"HTML 때 나는 대부분 phpquery에 의존합니다. http://code.google.com/p/phpquery/ 오히려 재성분.

이것은 내 해결책입니다.

<?php
// get links
$website = file_get_contents("http://www.example.com"); // download contents of www.example.com
preg_match_all("<a href=\x22(.+?)\x22>", $website, $matches); // save all links \x22 = "

// delete redundant parts
$matches = str_replace("a href=", "", $matches); // remove a href=
$matches = str_replace("\"", "", $matches); // remove "

// output all matches
print_r($matches[1]);
?>

문서/웹 사이트가 잘 형성되었는지 여부를 항상 알지 못하기 때문에 XML 기반 파서 사용을 피하는 것이 좋습니다.

친애하는

HTML의 PARING 링크는 AM HTML 파서를 사용하여 수행 할 수 있습니다.

모든 링크가 있으면 Simple Last Forward 슬래시의 색인을 얻으면 번호가 있습니다. 정수가 필요하지 않습니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow