Question

For example, I have the HTML:

<strong>this one</strong> <span>test one</span>
<strong>this two</strong> <span>test two</span>
<strong>this three</strong> <span>test three</span>

How get all text inside strong and span with regex?

Was it helpful?

Solution

Use a DOM and never use regular expressions for parsing HTML.

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('strong') as $tag) {
   echo $tag->nodeValue."<br>";
  }
foreach ($dom->getElementsByTagName('span') as $tag) {
    echo $tag->nodeValue."<br>";
}

OUTPUT :

this one
this two
this three
test one
test two
test three

Demo


Why I shoudn't use Regular Expressions to parse HTML Content ?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML.

That article was from our Jeff Atwood. Read more here.

OTHER TIPS

Use DOMDocument to load the HTML string and then use an XPath expression to get the required values:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

foreach ($xpath->query('//strong | //span') as $node) {
    echo $node->nodeValue, PHP_EOL;
}

Output:

this one
test one
this two
test two
this three
test three

Demo

You can use captured groups. Here are some examples:

<strong>([^\<]*)<\/strong>

Demo: http://regex101.com/r/sK5uF2

And

<span>([^\<]*)<\/span>

Demo: http://regex101.com/r/vJ2kP3

In each of these, the first captured group is your text: \1 or $1

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top