Is MSXML with xslt and substring-before newline / linebreak handling inconsistent?

StackOverflow https://stackoverflow.com/questions/22210631

  •  09-06-2023
  •  | 
  •  

سؤال

Note: Actual question at the very end.

I'm thoroughly confused by what I see while trying to juggle newline/linebreaks in a source XML file via xslt when comparing MSXML (IE11) with libxml2 / Firefox.

Essentially, both libxml2 and Firefox implement XML End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

Now, it seems I can easily establish that IE11's MSXML does not implement this properly.

Given an xml file

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<root> 
  <text>We would like:
* Free icecream
* Free beer
* Free linebreaks</text>
</root>

that contains Windows CRLF line endings in a text node, and using this xsl:

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8" indent="yes"/>

  <xsl:template match="/">
    <html>
      <body>
        <xsl:if test="contains(//text, '&#xD;&#xA;')">
          <p>The text contains CR+LF (0x0D+0x0A).</p>
        </xsl:if>
        <xsl:if test="contains(//text, '&#xD;')">
          <p>The text contains CR (0x0D).</p>
        </xsl:if>
        <xsl:if test="contains(//text, '&#xA;')">
          <p>The text contains LF (0x0A).</p>
        </xsl:if>
      </body>
    </html>
  </xsl:template>

</xsl:stylesheet>

MSXML will print

The text contains CR+LF (0x0D+0x0A).

The text contains CR (0x0D).

The text contains LF (0x0A).

wheras both FF and libxml2 (xsltproc.exe) will only print:

The text contains LF (0x0A).

So far so bad. The real question now is when I use substring-before and substring-after to isolate the newlines.

Adding this xsl:

<xsl:value-of select="'before-xA:{'"/>
<xsl:value-of select="substring-before(//text, '&#xA;')" />
<xsl:value-of select="'}='"/>
<xsl:value-of select="contains(substring-before(//text, '&#xA;'), '&#xD;')" />
<xsl:value-of select="' / after-xD:{'"/>
<xsl:value-of select="substring-after(//text, '&#xD;')" />
<xsl:value-of select="'}='"/>
<xsl:value-of select="contains(substring(substring-after(//text, '&#xD;'), 1, 2), '&#xA;')" />

IE11 prints:

before-xA:{We would like:}=false / after-xD:{* Free icecream * Free beer * Free linebreaks}=false

That is, even though MSXML sees both the CR and LF in the source XML, applying substring-before / substring-after the resulting substring will not contain either, although it should as far as I can tell.

So, what's going on here? Have I missed sth. about the substring-* functions? Is MSXML inconsistent?

هل كانت مفيدة؟

المحلول

It looks like what's happening here is that IE is performing the XML end of line handling on not just the input XML but also on the XSLT. Just try executing this in IE (with any input XML):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:msxsl="urn:schemas-microsoft-com:xslt"
                xmlns:fn="fn"
                exclude-result-prefixes="fn msxsl">
  <xsl:output method="xml" indent="yes"/>

  <msxsl:script implements-prefix="fn">
    function charCodes(str) {
    var result = '';
    for(var i = 0; i &lt; str.length; i += 1) {
    result += str.charCodeAt(i) + " ";
    }
    return result;
    }
  </msxsl:script>

  <xsl:template match="/">
    <html>
      <body>
        <xsl:if test="function-available('fn:charCodes')">
          <div>
            <xsl:text>Char code for xA: </xsl:text>
            <xsl:value-of select="fn:charCodes('&#xA;')"/>
          </div>
          <div>
            <xsl:text>Char code for xD: </xsl:text>
            <xsl:value-of select="fn:charCodes('&#xD;')"/>
          </div>
          <div>
            <xsl:text>Char code for xDxA: </xsl:text>
            <xsl:value-of select="fn:charCodes('&#xD;&#xA;')"/>
          </div>
        </xsl:if>
        <div>
          <xsl:text>String length of xDxA: </xsl:text>
          <xsl:value-of select="string-length('&#xD;&#xA;')"/>
        </div>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

The result this produces in IE 10 when I try it is:

Char code for xA: 10
Char code for xD: 10  
Char codes for xDxA: 10
String length of xDxA: 1  

So all xDxAs and xDs are being replaced with xA, and I think that explains perfectly the behavior you have been witnessing.

Incidentally, executing the same script in Firefox produces:

String length of xDxA: 2

And that explains what you saw in Firefox.

One final thing to note is that I can reproduce the above behavior in IE, but not in Visual Studio's XSLT functionality, so it seems that this behavior is present in some implementations of MSXSL, but not all of them.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top