Question

I am new to regular expressions. I want to fetch some data from a web page source. I used file_get_contents("url") to get the page's HTML source. Now I want to capture a portion within some special tags.

I found preg_match_all() works for this. Now I want some help to solve my problem and if possible help me to find out how to solve similar problems like this.

In the example below, how would I get the data within the <ul>? (I wish this sample HTML code could be easier for me to understand.)

<div class="a_a">qqqqq<span>www</span> </div>
<ul>
<li>
    <div class="a_a"><h3>aaaa</h3> aaaa aaaaa</div>
</li>
<li>
    <div class="b_b">bbbbb <span class="s-s">bbbb</span> bbbb</div>
</li>
<li>
    <div class="c_c d-d">cccc cccc ccccc</div>
</li>
</ul>
<table>
<tr>
    <td>sdsdf</td>
    <td>hjhjhj</td>
</tr>
<tr>
    <td>yuyuy</td>
    <td>ertre</td>
</tr>   
</table>
Was it helpful?

Solution

As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.


If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

$pattern = '~<ul>(.*?)</ul>~s';

It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines (\n). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

preg_match_all($pattern, $html, $out);

Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...


If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

$pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';

Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].


But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

$pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';

Get matches using preg_match_all

$html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
         <ul><li>.2.</li></ul>';

if(preg_match_all($pattern, $html, $out))
{
  echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
} else {

  echo "FAIL";
}

Matches between \K and (?= will be captured to $out[0]

  • \K resets beginning of the reported match (supported in PHP since 5.2.4)
  • the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).
  • [^>]* any amount of characters, that are not > (negated character class)
  • (?: starts a non-capturing group.

Used Modifiers: Uis (part after the ending delimiter ~)

U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

OTHER TIPS

Conside using strpos as mentioned here

$html = "the page's html source";
$first = strpos($html,'<ul>');
$last = strpos($html,'</ul>');

$ul = substr($html,$first,$last-$first); //the html between the <ul></ul>

If there are more than 1 pair of <ul> tags, then consider using an offset in strpos to grab the relevant bits.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top