Pergunta

I'm using http://tidy.sourceforge.net/ to convert HTML to XHTML and I want to transform this XHTML later with XSLT.

Unfortunately I tried a to parse a techcrunch site (just for testing). The techcrunch site contains php code and HTML tidy produces a NOT valid XML file with this php code.

Simplified input file dirty.htm:

<html>
<head>
</head>
<body>
  <a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>Google</a>
</body>
</html>

and my output file with HTML Tidy cleaned.htm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<p><a href="http://www.crunchbase.com/company/google" onclick="<?php tc_set_omniture_attr(">Google</a></p>
</body>
</html>

The main problem is the < in onclick which is not allowed as a XML attribute! XSLTProc refuses to open this not valid XML.

My HTML Tidy Options tidyconfig.cfg:

output-xhtml: 1
indent: 0
tidy-mark: 0
wrap: 0
alt-text:
doctype: strict
force-output: 1
numeric-entities: 1
clean: 1
bare: 1
word-2000: 1
drop-proprietary-attributes: 1
enclose-text: 1
logical-emphasis: 1

HTML Tidy commandline:

tidy -quiet -config tidyconfig.cfg -output cleaned.htm dirty.htm

Did I missed any HTML Tidy option? All Tidy options: http://tidy.sourceforge.net/docs/quickref.html

Foi útil?

Solução

Tidy only has limited support for PHP code. I suspect it is getting confused because the PHP block is inside an attribute (that isn't closed).

It might have a better chance at:

<a href="..." onclick="<?php tc_set_omniture_attr("post_widget_crunchbase") ?>">Google</a>

Sorry, not sure there's much else that can be done. Hope that helps.

Outras dicas

Do you have the option to remove onclick from the link, and instead move the onclick script to between some script tags?

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top