What causes extremely poor XSLT performance: text fragments or priorites?

https://stackoverflow.com/questions/23611464

20-07-2023
|

Question

I am trying to do some cleanups using XSLT. I want to do some changes on text fragments and leave all the other nodes in peace. However my current implementation runs very slow and consumes a lot of memory. The removal of a small template changes the run time from a minute to a fraction of a second.

This is the XSLT:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:import href="../common/identity.xsl"/>

    <xsl:template match="text()" priority="100">
        <xsl:variable name="pass1" select="replace(., '(_|~)', ' ')"/>
        <xsl:variable name="pass2" select="replace($pass1, ' , ', ', ')"/>

        <xsl:variable name="final" select="$pass2"/>

        <xsl:value-of select="$final"/>
    </xsl:template>

    <xsl:template match="body/text()[1][. = ' '] | body/text()[last()][. = ' ']"
                  priority="200"/>
</xsl:stylesheet>

The first template replaces some characters, the second template removes the first and last text fragments, but only if they contain exactly one space (sadly normalize-space does not fit my needs).

This XSLT runs very slow and consumes a lot of memory. If I remove the last templates, the same XSLT runs fast and using a normal amount of memory.

The XSLT is run using Saxon-(HE|EE) 9.5.1.3 inside oXygen 15.2.

What is causing this big loss of performance? Is it the use of text fragments in general? The use of priorities? The use of [1] and [last()]?

Solution

using not(following-sibling::text()) instead of last() fixed it. Could you explain why or give some pointers to the problems of last()?

There are two ways of evaluating patterns: left-to-right, and right-to-left, corresponding to the "formal" and "informal" semantics given in section 5.5.3 of the specification. The right-to-left method is much more efficient, but it cannot be used for all patterns; in particular, patterns that use positional predicates are tricky. Saxon will handle a number of cases efficiently, including match="para[last()]", but for some others, including match="para[last()-1]" and (it seems) match="section/para[last()]", it takes the slow-but-methodical route. I'll take a look at the code and see if this can be improved.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow