Html Tidy with php code, XHTML is not valid XML afterwards
Pergunta
I'm using http://tidy.sourceforge.net/ to convert HTML to XHTML and I want to transform this XHTML later with XSLT.
Unfortunately I tried a to parse a techcrunch site (just for testing). The techcrunch site contains php code and HTML tidy produces a NOT valid XML file with this php code.
Simplified input file dirty.htm
:
<html>
<head>
</head>
<body>
<a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>Google</a>
</body>
</html>
and my output file with HTML Tidy cleaned.htm
:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<p><a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr(">Google</a></p>
</body>
</html>
The main problem is the <
in onclick
which is not allowed as a XML attribute! XSLTProc refuses to open this not valid XML.
My HTML Tidy Options tidyconfig.cfg
:
output-xhtml: 1
indent: 0
tidy-mark: 0
wrap: 0
alt-text:
doctype: strict
force-output: 1
numeric-entities: 1
clean: 1
bare: 1
word-2000: 1
drop-proprietary-attributes: 1
enclose-text: 1
logical-emphasis: 1
HTML Tidy commandline:
tidy -quiet -config tidyconfig.cfg -output cleaned.htm dirty.htm
Did I missed any HTML Tidy option? All Tidy options: http://tidy.sourceforge.net/docs/quickref.html
Solução
Tidy only has limited support for PHP code. I suspect it is getting confused because the PHP block is inside an attribute (that isn't closed).
It might have a better chance at:
<a href="..." onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>">Google</a>
Sorry, not sure there's much else that can be done. Hope that helps.
Outras dicas
Do you have the option to remove onclick from the link, and instead move the onclick script to between some script tags?