Domanda

here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:

preg_match_all('<a href="http://" target="_parent">Text here</a>', subject, matches, PREG_SET_ORDER);

HTML:

<a href="http://" target="_parent">

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Text</B> - Text </DIV>
    </FONT>

</a>

</DIV>
È stato utile?

Soluzione

To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.

The best way is to use an XML parser.

<?php

$html = '<a href="http://" target="_parent">Text here</a>';
function extractTags($html) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadHTML($html); // because dom will complain about badly formatted html
    $sxe = simplexml_import_dom($dom);
    $nodes = $sxe->xpath("//a[@target='_parent']");

    $anchors = array();
    foreach($nodes as $node) {
        $anchor = trim((string)dom_import_simplexml($node)->textContent);
        $attribs = $node->attributes();
        $anchors[$anchor] = (string)$attribs->href;
    }

    return $anchors;
}

print_r(extractTags($html))

This will output:

Array (
    [Text here] => http://
)

Even using it on your example:

$html = '<a href="http://" target="_parent">

<FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Text</B> - Text </DIV>
            </FONT>

            </a>

            </DIV>
            ';
            print_r(extractTags($html));

will output:

Array (
    [Text - Text] => http://
)

If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.

Altri suggerimenti

You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.

<?php

$html='<a href="http://" target="_parent">Text here</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
    if ($tag->getAttribute('target') === '_parent') {
       echo $tag->nodeValue;
    }
}

OUTPUT :

Text here
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top