Try using the answer from this question:
I tried to add this as as it stands, but StackOverflow requires me to add some description to the answer, or it automatically gets converted into a comment, which can't be accepted as an answer.
Question
I'm trying to find a way to reliably locate and replace <
and >
symbols within an HTML/XML formatted string that do not belong to tags.
Basically I start with an HTML string and convert it into something usable by PDFLib, which uses a form of XML to describe documents to be written as PDF's. However if there is a <
within in the content it sees it as the opening of a tag and throws a parse exception.
<p>This is a test where 6 < 9</p>
<p>This is part of <strong>The same test</strong></p>
<p>This should also work 6<99999</p>
The text surrounding the <
is not always numbers, it is user entered and could be anything such as Grade<C
, Blue<Red<Green
, Test < Test2
.... just about anything really
This is a test where 6 <charref fontname=Helvetica encoding=unicode><<resetfont> 9\n
This is part of <fontname=Helvetica fontstyle=bold encoding=unicode>The same test<resetfont>\n
This should also work 6<charref fontname=Helvetica encoding=unicode><<resetfont>99999\n
I've tried a str_replace
and preg_replace
, but can't find a solution that will reliably leave the tags alone and replace just the <
in context.
Parsing the DOM also seems to fail as the DOMDocument
sees the <
as an opening tag as well
Using htmlspecialchars
on the string converts all the tags <>
into <>
as well which is no good.
Does anyone have any ideas?
Solution
Try using the answer from this question:
I tried to add this as as it stands, but StackOverflow requires me to add some description to the answer, or it automatically gets converted into a comment, which can't be accepted as an answer.
OTHER TIPS
try reading the string from start char by char if it encounters a < push it in a buffer if > is found without a space then its a tag else if it encounter a < again mark the previous as < and put next in buffer ... and repeat until the end of string
While it's no longer maintained, I think the php port of html5lib is probably your best bet for parsing bad markup.
A simple call like this:
require_once 'your-path-path-to-html5lib/Parser.php';
$dom = HTML5_Parser::parse($input);
will take bad markup in $input
and return a valid php DOMDocument.
From there you can save it back to a string with $dom->saveHTML()
or $dom->saveXML
, or extract the bits you want with the DOM API.
Note that this will produce a full HTML document with head
and body
etc. even if your original data didn't include that.
If you just want to parse an HTML fragment, you can do:
$dom = HTML5_Parser::parseFragment($input);
which will return a DOMNodeList.
HTML entities are the best way to do such things <>
are the entities used to replace <>
in HTML. Even using the <code>
tag. You can use these entities and replace them with <>
in your HTML Tags. www.w3schools.com/html/html_entities.asp