Вопрос

I'm currently working on a XML/HTML parser for node.js (if your interested: link). Let me get right to the point: I need to know how I should handle leading whitespace inside processing instructions. Should these be equal?

  1. <?asdf ?>
  2. < ?asdf ?>
  3. <? asdf ?>
  4. < ? asdf ?>

I guess that strict XML will just allow the first one (but what's the expected behavior then? I don't want to validate, I want to accept the most constructs I can), it's more a philosophical question.

Thanks in advance!

Это было полезно?

Решение

According to the XML specification only the first representation is allowed. I'd say the other representations should result in an error.

You could add a some pre-processing to clean up the invalid constructs (remove the whitespace) and then read the data as XML.

This pre-processor would clean your data before it reaches your XML parser – it could be another program. That way your XML parser would only get valid XML (less special cases to parse) if the input data is halfway valid. If your parser does still encounter an error, you'd assume that the input was not XML-ish at all.

So for example during pre-processing the data would be altered, finally parsed as XML: Remove bogus whitespace (one preprocessor) → Guess closing tags (other preprocessor) → Parse as XML

The question for the allowed constructs is answered by your statement to accept as most you can. Because this is the case you would remove all whitespace after a <, if a ? follows, again do remove whitespace until the next word – then parse as XML.

Personally, I don't think accepting most constructs is desirable. If your data contains errors, they should be handled as such.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top