What is an XML infoset and in what ways is it different to an XML document?

https://stackoverflow.com/questions/839229

22-07-2019
|

Question

I've tried to read http://www.w3.org/TR/xml-infoset/ and the wikipedia entry. But frankly I'm still not sure what the difference is.

The quote :

An XML document has an information set if it is well-formed and satisfies the namespace constraints. There is no requirement for an XML document to be valid in order to have an information set.

From the wikipedia entry seems to not make sense. How can a non valid document have any semantics, and thus how can it be an 'information' set?

What is this 'infoset' that

well-formed and satisfies the namespace constrained

XML has? And in what way it is useful in itself. In other words why is it, semantically speaking, necessary to define the XML infoset? Is there any information that cannot be represented in XML? If so I can see the limiting set of the XML Infoset, but if not surely the XML Infoset is as meaningless as term 'information'?

Thank you for the interesting answers: I still cannot grasp why the Xml infoset has any purpose as opposed to the term infoset. But you guys have given me the direct answer to the question.

Solution

A useful way of thinking of the distinction between XML text and the XML infoset is to consider the Fast Infoset. This is a binary representation of the XML infoset.

So you have the an abstract "infoset" which is a conceptual model representing XML data (nodes, elements, attributes, etc). This can be physically represented as a text XML document, or as a Fast Infoset stream. Both represent the same data, but in radically different ways.

OTHER TIPS

XML is not text. XML "is" the XML infoset. This may then be serialized into text in an XML document, but it is the XML infoset that is the reality.

The infoset may exist in memory as a DOM tree, for instance. It exists in memory as the implementation of an abstract object model.

What if I serialized it as UTF-8 and then as UTF-16. Chances are the results would be two different sets of bits, but same infoset.

Consider also that with text it makes sense to do things like string concatenation. You don't want to concatenate a "<" into the middle of an XML element. You have to encode it first. Why would you have to do this if it were just text? If you used the DOM, for instance, you'd just say element.InnerText = "<"; When serialized, the "<" would be encoded into "<". Yet it's the same infoset.

A valid XML document fulfills the requirements of a DTD or XSD (or other standards). If it is well-formed, it still can be 'invalid', if it violates the rules in the given DTD or XSD.

Edit: I am new to this area of XML, but it looks like the infoset is the 'abstract level' description of the parts of a XML document, independent of the actual technical implementation - which could be, for example, a Document Object Model implementation.

An XML infoset is an abstract set of concepts such as attributes and entities that can be used to describe a valid XML document. According to the specification, "An XML document's information set consists of a number of information items; the information set for any well-formed XML document will contain at least a document information item and several others."

Just because an XML document is an infoset does not mean it conforms to an XSD and is a valid XML document.

Please see this link from MSDN. http://msdn.microsoft.com/en-us/library/aa468561.aspx

It is a really good explanation of the concepts and will hopefully make it clear to you.

A good example I've just come across is in David Chappell's WCF PDF. This is how it works when using TCP for example:

To allow optimal performance when both parties in a communication are built on WCF, the wire encoding used in this case is an optimized binary version of SOAP. Messages still conform to the data structure of a SOAP message, referred to as its Infoset, but their encoding uses a binary representation of that Infoset rather than the standard angle-brackets-and-text format of XML. Using this option would make sense for communicating with the call center client application, since it’s also built on WCF, and performance is a paramount concern.

XML is a language, therefore it has syntax, and XML Infoset has specification of the data model, this is due to applications have need that are based on data model rather than syntax; XML comes before XML Infoset; Reference: protocol considerations for Web Linkbase Access

XML Infoset is a requirement on how you should structure serialised XML document.

Serialized XML can have different forms, like some binary format (Fast Infoset) or text (most popular form).

Basically for XML document format (text), each element and attribute should be defined in XSD trough corresponding namespace.

Here you will find an example.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow