After parsing HTML or XML file, we can get the DOM tree.

After parsing C, C++, or JavaScript, we can get the Syntax tree.

Note that the syntax tree is constructed based on the context-free grammar which specifies a valid C/C++/JS program.

But it seems the DOM tree is just a pure hierarchy structure specified only by the HTML/XML file. Is that true? Is that the reason that Schema Validation has been done after parsing? What is the fundamental difference between these two kinds of parse trees?

有帮助吗?

解决方案

Like any other language, XML is described by a grammar. XML's grammar is rather simple (start-tags, end-tags, correct nesting). So the syntax tree might seem simple as well (just an hierarchy of elements). An XML schema is another grammar that describes an XML file's content.

So basically it's two parsers being invoked after each other. The first one verifies that all start-tags have an end-tag and that the nesting is right.

The second parser verifies that the XML file's content is structured according to the schema (grammar).. like that an element named "B" can only be contained within an element named "A".

This shouldn't be compared to parsing programming languages like C since you cannot change a programming language's syntax. If-statements can only appear within function bodies, not outside and you cannot change that. However in XML you can specify that "B"-elements can only appear within "A"-elements, or that "A"-elements can only appear within "B"-elements.. all by specifying the grammar of your XML file's content in form of a schema.

其他提示

Thank you for Ira Baxter and Guy Coder's interests.

I re-searched for a while, and compared these two cases. My impression is like this:

The "parsing" for XML can be either "validating parsing" or "non-validating parsing". For the later one, the parser does not check its syntax against the Document Type Definition (DTD) file. This parser only produces the hierarchy of the elements in the XML file. So it is lighter than the "validating parsing".

The "parsing" for C/C++/Java generates the syntax tree based on its context-free grammar. So, informally, it is more like the "validating parsing".

PS: I am not an expert, so welcome any comments if you found my understanding is not correct.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top