Question

Modern dialects of HTML and rules of good practice disallow to omit the semicolon in HTML entities(&likethat;). But I have a task to parse arbitrary pages and have to deal with the bad html entities without semicolons. And this is perfectly rendered by browsers. How can I decode HTML entities without semicolons into their respective UTF-8 equivalents with PHP?

Was it helpful?

Solution

You can get a list of all html entities and use this to replace all without semicolon by their UTF-8 representations:

// get all HTML entities
$mapping = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES | ENT_HTML5, 'UTF-8');

// change array values representing the entities to regex pattern with negativ lookahead for semicolon
array_walk($mapping, function(&$value) { $value = '/'.rtrim($value, ';').'(?!;)/'; });

// replace all entities without semicolon by their utf8 representation
$html = preg_replace(array_values($mapping), array_keys($mapping), $html);

OTHER TIPS

My guess would be that you could try loading the document using DOMDocument::loadHTML, and try saving it then using DOMDocument::saveHTML.

You may specify additional options using libxml constants.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top