PHP function to turn arbitrary “description” into valid xml data for podcast feed

https://stackoverflow.com/questions/3163142

02-10-2019
|

Question

I am reading the documentation for creating a podcast feed suitable for iTunes, and the Common Mistakes section says:

Using HTML Named Character Entities.

<! — illegal xml — >
<copyright>&copy; 2005 John Doe</copyright>

<! — valid xml — >
<copyright>&#xA9; 2005 John Doe</copyright>

Unlike HTML, XML supports only five "named character entities":

character   name               xml
&           ampersand          &amp;
<           less-than sign     &lt;
>           greater-than sign  &gt;
’           apostrophe         &apos;
"           quotation          &quot;

The five characters above are the only characters that require escaping in XML. All other characters can be entered directly in an editor that supports UTF-8. You can also use numeric character references that specify the Unicode for the character, for example:

character   name                       xml
©           copyright sign             &#xA9;
℗           sound recording copyright  &#x2117;
™           trade mark sign            &#x2122;

For further reference see XML Character and EntityReferences.

Right now I'm using htmlentities() under PHP5 and the feed is validating and working. But from what I gather some things that could get put into content might become entities that would make it no longer be valid. What's the best function to use to assure I'm not passing along bad data? I'm paranoid something will get entered and get entity-ized and break the feed -- should I just use str_replace() and replace with named entities and leave the rest alone? Or can I use htmlspecialchars() somehow?

So in short, what's a drop-in replacement for htmentities() that will make sure input is safe for description, titles, etc in a podcast RSS feed?

Solution

You can either:

Use a CDATA block instead (just make sure you're using the correct encoding, i.e., the encoding of the XML file matches the encoding of the data). The only think you have to lookout for is ]]>, which cannot be put literally in a CDATA block.
Use mb_encode_numericentity instead of htmlentities (possibly combined with htmlspecialchars and a previous decoding of html entites with mb_convert_encoding).

If the encoding of the XML file is UTF-8, you can just remove the entities. Suppose you have the following HTML fragment:

&copy; 2005 John Doe

Then, you could just do:

$data = "&copy; 2005 John Doe";
$data = mb_convert_encoding($data, "UTF-8", "HTML-ENTITIES");
$data = htmlspecialchars($data, ENT_NOQUOTES, "UTF-8");

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow