php strip tags except '<>' (book name)

Question 1

The following will utilize DOM to find any elements that are not valid HTML4 elements and consider them book titles. These will then be whitelisted in strip_tags.

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);

echo strip_tags($html, implode(',', 
    array_map(
        function($error) {
            return '<' . sscanf($error->message, 'Tag %s invalid')[0] . '>';
        },
        libxml_get_errors()
    )
));

Online Demo

Take note that any book titles starting with a valid HTML tag will be considered valid HTML and thus stripped (for instance "Body of Evidence" or "Head First PHP"). Also note that <gone with the wind> is considered to be the element "gone" with attributes "with", "the" and "wind". For valid elements, you could check whether they have only empty attributes and then strip them if not, but that would still not be 100% accurate when the title consists of just the valid element name. In addition, you could check for closing tags, but I am not aware on how to do that with DOM (XMLParser can detect them though).

In any case, figuring out a better format for these book titles, e.g. using namespaces or using a different delimiter than angle brackets would greatly improve your chances to do this properly.

Question 2

You should consider using < (<) and &rt; (>).

Question 3

Here's a simple, although not foolproof solution for you.

PHP

$data = "<gone with the wind> <p>a hotest book</p>";
$out = preg_replace("/\<\w+\>|\<\/\w+\>/im", "", $data);

var_dump($out);

Output

string '<gone with the wind> a hotest book' (length=34)

Would Match

<p>text</p>
<anything>text</anything>

Would Not Match

Like one has said before, theres no way for the code to know what a Book Title looks like.

<img src="url">

Although, if you expect your data to be simple <p> tags, then this would work.

Crazy solution, thought I'd throw it out there.

Question 4

you can also do it like that easier.

   <?php
   $string = htmlspecialchars("<gone with the wind>");
   echo strip_tags( "$string <p>a hotest book</p>");
   ?>

this wil out put :

   <gone with the wind> a hotest book

DEMO HERE

Question 5

$string = '<gone with the wind> <p>a hotest book</p>';
$string = strip_tags(preg_replace("/<([\w\s\d]{6,})>/", "&lt;$1&gt;", $string));
$string = html_entity_decode($string);

The above will convert any 'tags' with more than six letters between <> to <> allowing you to then use strip_tags.

You may need to experiment with the six value depending on your incoming data. If you get a tag like <article> you may need to push it higher.

Question 6

The best thing I could think of is to do something like this, since I didn't know what types of tags would be used I just assumed all of them, and this should remove any valid html tag not just ones that look like they could be tags.

<?php
$tags = array("!DOCTYPE","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","big","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","command","datalist","dd","del","details","dfn","dir","div","dl","dt","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","kbd","keygen","label","legend","li","link","map","mark","menu","meta","meter","nav","noframes","noscript","object","ol","optgroup","option","output","p","param","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","small","source","span","strike","strong","style","sub","summary","sup","table","tbody","td","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr");

$string = "<gone with the wind> <p>a hotest book</p>";


echo preg_replace("/<(\/|)(".implode("|", $tags).").*>/iU", "", $string);

The final output looks like this:

<gone with the wind> a hotest book

Question 7

You're going to be out of luck on this because you have no way of knowing which things in <> are HTML tags and which are the book title. You can't even write something that looks for things that look like tags but aren't actually valid HTML tags, since you might get a record for the Monkees' 1968 movie "Head", which would come across as <Head> which certainly is a valid HTML tag.

You'll need to work this out with the supplier of your data, and then you can use the PHP strip_tags function.