Question

Does anyone with more knowledge than me about regular expressions know how to split up html code so that all tags and all words are seperated ie.

<p>Some content <a href="www.test.com">A link</a></p>

Is seperated like this:

array = { [0]=>"<p>",
          [1]=>"Some",
          [2]=>"content",
          [3]=>"<a href='www.test.com'>,
          [4]=>"A",
          [5]=>"Link",
          [6]=>"</a>",
          [7]=>"</p>"

I've been using preg_split so far and have either successfully managed to split the string by whitespace or split by tags - but then all the content is in one array element when I eed this to be split to.

Anyone help me out?

Was it helpful?

Solution

preg_split shouldn't be used in that case. Try preg_match_all:

$text = '<p>Some content <a href="www.test.com">A link</a></p>';
preg_match_all('/<[^>]++>|[^<>\s]++/', $text, $tokens);
print_r($tokens);

output:

Array
(
    [0] => Array
        (
            [0] => <p>
            [1] => Some
            [2] => content
            [3] => <a href="www.test.com">
            [4] => A
            [5] => link
            [6] => </a>
            [7] => </p>
        )

)

I assume you forgot to include the 'A' in 'A link' in your example.

Realize that when your HTML contains < or >'s not meant as the start or end of tags, regex will mess things up badly! (hence the warnings)

OTHER TIPS

You could check out Simple HTML DOM Parser

Or look at the DOM parser in PHP

Give Simple HTML Dom Parser a try. HTML is too irregular for regular expressions.

I disagree with Bart about the recommendation of preg_match_all() over preg_split().

The task is literally to "split" the whole string on a variety of delimiters. I, first, recommend the stability of using a dom parser over regex, but if you don't require that level of stability because your input html is relatively predictable/simplistic, then regex can be used as a cheaper, more concise alternative.

Code: (Demo)

$html = <<<HTML
<p>Some content <a href="www.test.com">A link</a></p>
HTML;

var_export(preg_split('~\s+|(<[^>]+>)~', $html, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE));

Output:

array (
  0 => '<p>',
  1 => 'Some',
  2 => 'content',
  3 => '<a href="www.test.com">',
  4 => 'A',
  5 => 'link',
  6 => '</a>',
  7 => '</p>',
)

My pattern splits on one or more whitespace characters or on a (weak interpretation of a) html tag. The whitespaces are merely discarded. The tags are retained in the output.

Beyond logical semantics, preg_split() has the additional benefit of producing a less bloated and therefore more direct output. preg_split() provides a one dimensional array and preg_match_all() provides a multidimensional array.

Finally, preg_split() cannot "fail" like preg_match_all() might. Imagine the unlikely fringe case where the input string doesn't contain any spaces or tags. preg_split() will return the whole input string as a single element array (useful and consistent with more common input strings). preg_match_all() will generate an empty array (not very useful).

I currently use Simple HTML DOM Parser in several applications and find it to be an excellent tool, even when compared against other HTML parsers written in other languages.

Why exactly are you splitting up HTML into the string of tokens you described? Is not a tree-like structure of DOM elements a better approach for your specific application?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top