Question

for example.

<html>
<head></head>
<body>
<div>
<h1>-----> hello! ----< </h1>
</div>
</body>

I want to replace the > and < inside the h1 tag with the corresponding > and <

which is the correct pattern?

thanks in advance!

Was it helpful?

Solution

You could throw it at tidy (see the docs) and see if it can fix the errors. A lot better than trying to do the "right thing" on your own with regex.

$html = <<<EOT
<html>
<head></head>
<body>
<div>
<h1>-----> hello! ----< </h1>
</div>
</body>
EOT;

$config = array ( 
  'clean'                       => true, 
  'drop-proprietary-attributes' => true, 
  'output-xhtml'                => false, 
  'show-body-only'              => false, 
  'wrap'                        => '0'
); 

$tidy = new tidy();
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

echo tidy_get_output($tidy);

It might be that you must enable tidy first in your PHP environment.

OTHER TIPS

In agreement with the commenter "Why is this broken HTML being generated in the first place?", if you represent documents like this then you will have exactly these problems that you are currently having. There are two valid situations

  • You have some data (not HTML escaped) e.g. a bunch of strings in PHP
  • You have an HTML document, containing tags, and text which is HTML escaped

So when you generate the HTML document from your source data (strings, database) you need to do the escaping them (e.g. by using htmlspecialchars as another answerer correctly pointed out.)

You need to avoid, at all costs, a situation where you have a string like you have, which has HTML tags and non-escaped text.

For example, if you text contained the text <b>text</b> and you literally wanted that text to be displayed in the HTML document i.e. you wanted the angle-brackets to be seen rather than the text be in bold (e.g. you were writing a document about how to program HTML) then you have no way to differentiate that from actual HTML code once you have such a document.

I would pass it through tidy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top