How to identify if a text is HTML or not? (in PHP)

https://stackoverflow.com/questions/18242465

24-06-2022
|

Question

I want to read text entries from a database, some of which are actually HTML entries, others are just plain text that might contain HTML markup which should be displayed as text.

Those that are plain text should then be converted to HTML, by first calling PHP's htmlspecialchars() function and then running the result through HTMLPurifier.

Or in other words, I'm looking for some tips on how to implement the isHTML() function:

$text = getTextFromDatabase();
if (!isHTML($text)) {
    $text = htmlspecialchars($text);
}
$purifier = new HTMLPurifier();
$clean_html = $purifier->purify($text);

So for example following text would be run through htmlspecialchars:

The <p> tag of HTML has to be followed by a </p> tag to end the paragraph.

And following text would not be run through htmlspecialchars:

<p>These are few lines of HTML.</p>
<div>There might be multiple independent</div>
<p>but valid HTML blocks in it.</p>

It seems like there should already be an isHTML() function out there, but I just can't happen to find it and I don't want to reinvent the wheel :-). Maybe it's even possible to do this with some kind of HTMLPurifier settings?

Note that, if the HTML code is buggy, this should be handled by HTMLPurifier and the code should not be run through htmlspecialchars. :-) Like for example having an opening <p> tag when there really should be a closing </p> tag in the HTML code.

Any help is appreciated, thanks already :-),
Robert.

Solution 3

you can only check for chars specific for html in string

function is_html($string)
{
  return preg_match("/<[^<]+>/",$string,$m) != 0;
}

OTHER TIPS

You can try to use this function

function isHTML($string){
    return ($string != strip_tags($string));
}

Consider this logic: If a valid html text is detected by htmlentities then the input text and the output text from htmlentities are different. So:

function isHTML($text){
   $processed = htmlentities($text);
   if($processed == $text) return false;
   return true; 
}

I hope this works for you

If only purpose is to detect that string contains any html tags or not. No matter tags are valid or not then you can try this:

function is_html($string) {
  // Check if string contains any html tags.
  return preg_match('/<\s?[^\>]*\/?\s?>/i', $string);
}

You can verify this here https://regex101.com/r/2g7Fx4/4

I was thinking if we can compare the striptagged version of string with the original. If they differ - then there have been something to strip. This guy proposes the same thing: https://subinsb.com/php-check-if-string-is-html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow