문제

I'm using http://tidy.sourceforge.net/ to convert HTML to XHTML and I want to transform this XHTML later with XSLT.

Unfortunately I tried a to parse a techcrunch site (just for testing). The techcrunch site contains php code and HTML tidy produces a NOT valid XML file with this php code.

Simplified input file dirty.htm:

<html>
<head>
</head>
<body>
  <a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>Google</a>
</body>
</html>

and my output file with HTML Tidy cleaned.htm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<p><a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr(">Google</a></p>
</body>
</html>

The main problem is the < in onclick which is not allowed as a XML attribute! XSLTProc refuses to open this not valid XML.

My HTML Tidy Options tidyconfig.cfg:

output-xhtml: 1
indent: 0
tidy-mark: 0
wrap: 0
alt-text:
doctype: strict
force-output: 1
numeric-entities: 1
clean: 1
bare: 1
word-2000: 1
drop-proprietary-attributes: 1
enclose-text: 1
logical-emphasis: 1

HTML Tidy commandline:

tidy -quiet -config tidyconfig.cfg -output cleaned.htm dirty.htm

Did I missed any HTML Tidy option? All Tidy options: http://tidy.sourceforge.net/docs/quickref.html

도움이 되었습니까?

해결책

Tidy only has limited support for PHP code. I suspect it is getting confused because the PHP block is inside an attribute (that isn't closed).

It might have a better chance at:

<a href="..." onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>">Google</a>

Sorry, not sure there's much else that can be done. Hope that helps.

다른 팁

Do you have the option to remove onclick from the link, and instead move the onclick script to between some script tags?

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top