سؤال

Please consider this kind of XHTML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head></head>
<body>
<!--- Some comment with 3 dashes that causes parsing error --->
<!-- Rest of XHTML -->
</body>
</html>

and this partial VBScript code that I'm trying to do the parsing:

With CreateObject("MSXML2.DOMDocument.6.0")
    .async = False
    .setProperty "ProhibitDTD", False
    .validateOnParse = False
    .setProperty "SelectionLanguage", "XPath"
    .setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
    .load(url)
End With

I get error report:

Incorrect syntax was used in a comment

apparently because comment uses 3 dashes.

Does anyone know how to resolve this (without using HTTP request and correcting the XHTML source)?

هل كانت مفيدة؟

المحلول

As the standard clearly states:

For compatibility, the string " -- " (double-hyphen) MUST NOT occur within comments.

no decent parser should accept your 'XML' as well-formed. You may search for a faulty parser - this indicates that some version of BeautifulSoup (3.08) may accept -- in comments - but the real solution is either to clean the data before .loadXml or - better - to take a big stick to the author.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top