Is there a way to restrict an XML document from containing external entity references via XSD schema?

StackOverflow https://stackoverflow.com/questions/18211532

  •  24-06-2022
  •  | 
  •  

Question

XML files are used as data exchange format between REST webservices. The services are designed such that there is never a need for external entity references within the XML files. What I want is to have a XSD schema that prevents such references.

My ideas about this is to create a schema file that uses a regular expression like:

<xs:simpleType name="stringValue">
    <xs:restriction base="xs:string">
        <xs:pattern value="^[a-zA-Z -]{2,32}$" />
    </xs:restriction>
</xs:simpleType>

If an entity like <foo>&externalRef</foo> get validated, it will fail because the ampersand character isn't part of the regular expression!

What other measures are to achieve this?

Was it helpful?

Solution

No, the constraint you want is not expressible with XSD.

XSD operates on an XML information set (typically one generated by an XML parser), and the information items which constitute its input do not preserve information about the original entity structure of the XML document.

Those who (like you) wish to forbid entity references in XML input typically do so either by an ad hoc rule to that effect, or if they are operating on the info-set level by a rule forbidding the infoset to reflect the existence of a document type declaration.

[Later addition] That last bit appears to be hard to digest. Consider a specification (such as the SOAP spec) which wants to define its operations and constraints at the level of the XML infoset, and not at the level of XML character streams. It wants, that is, to talk about elements and attributes, and not about angle brackets. At the same time, it wants to forbid entity references. Such a specification cannot forbid entity references using constraints written in XSD (or in DTDs, or Relax NG, or Schematron, ... -- because all of these also operate at the infoset level, and there are no entity references in the level of abstraction on which they operate. . But any spec can constrain the set of infosets which are acceptable as input to conforming processes. (After all, that's what we're doing when we say 'the root element must be name foo:bar and have a baz attribute', right? We're defining the set of input infosets a conforming processor is expected to support.) So spec which want, like SOAP, to define things at the infoset level, and which also want, like SOAP, to forbid entity references, typically (at least, in my experience) say "The document information item must not contain any document type declaration information item among its children." This technique may be of importance only to standards nerds and language lawyers, but for those who care about XML and XML standards, it is important to understand both how SOAP goes about forbidding entity references and why it strikes some including me as a bad idea. [End addition]

Since all entity references in a well-formed XML document refer to entities declared in the document type definition, the absence of a document type declaration suffices to make all entities other than the predefined ones (lt, gt, apos, and quot) impossible.

Your technical question is now answered, but I would be remiss if I didn't point out that by every measure I can imagine your goal is an unwise one.

The world would (it says here) be a better place if you and others (like the inventors of SOAP) allowed the creators of XML documents to use XML as it is specified. You may see no need for entity references, where someone managing a system for which you have no responsibility may see just such a need. Do you really think you know their system and workflow better than they do? Why make work for yourself and others by defining an ad hoc subset of XML instead of using it as it is specified and implemented? The usual response to this is fear of the billion-laughs attack, which is (it says here) a completely bogus argument. Resource attacks are better handled by having the XML parser impose a limit on the maximum length of entity replacement text, or by running the XML parser (and other processes which handle untrusted input) in a process with resource limits imposed by the operating system.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top