The following will utilize DOM to find any elements that are not valid HTML4 elements and consider them book titles. These will then be whitelisted in strip_tags
.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
echo strip_tags($html, implode(',',
array_map(
function($error) {
return '<' . sscanf($error->message, 'Tag %s invalid')[0] . '>';
},
libxml_get_errors()
)
));
Take note that any book titles starting with a valid HTML tag will be considered valid HTML and thus stripped (for instance "Body of Evidence" or "Head First PHP"). Also note that <gone with the wind>
is considered to be the element "gone" with attributes "with", "the" and "wind". For valid elements, you could check whether they have only empty attributes and then strip them if not, but that would still not be 100% accurate when the title consists of just the valid element name. In addition, you could check for closing tags, but I am not aware on how to do that with DOM (XMLParser can detect them though).
In any case, figuring out a better format for these book titles, e.g. using namespaces or using a different delimiter than angle brackets would greatly improve your chances to do this properly.